New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Thread caches for pooled buffers #2284

Closed
wants to merge 1 commit into
base: 4.0
from

Conversation

Projects
None yet
5 participants
@normanmaurer
Copy link
Member

normanmaurer commented Mar 3, 2014

No description provided.

@normanmaurer

This comment has been minimized.

Copy link
Member

normanmaurer commented Mar 3, 2014

Please note this is still a work of progress and I just opened the PR to make it easy to follow and review for now. Should be complete tomorrow.

@normanmaurer normanmaurer added this to the 4.0.18.Final milestone Mar 3, 2014

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 3, 2014

Build result for #2284 at 03de264: Failure

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 3, 2014

Build result for #2284 at 7fa5646: Failure

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 3, 2014

Build result for #2284 at c529851: Success

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 4, 2014

Build result for #2284 at 395976c: Failure

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 4, 2014

Build result for #2284 at 3cc9007: Success

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 4, 2014

Build result for #2284 at 32956f0: Success

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 4, 2014

Build result for #2284 at 9142a36a73cc891006773ee323ee2e68fc79e34d: Success

@normanmaurer

This comment has been minimized.

Copy link
Member

normanmaurer commented Mar 4, 2014

Here are some stats:

No caches (before this PR):

[nmaurer@xxx]~% wrk/wrk  -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' -d 120 -c 256 -t 16 --pipeline 256  http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
  16 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    24.28ms   17.85ms 288.40ms   89.91%
    Req/Sec   176.91k    36.44k  291.96k    70.07%
  337467919 requests in 2.00m, 45.57GB read
Requests/sec: 2812559.99
Transfer/sec:    388.93MB

bildschirmfoto 2014-03-04 um 13 25 23

Using caches (with this PR included!):

nmaurer@xxx]~% wrk/wrk  -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' -d 120 -c 256 -t 16 --pipeline 256  http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
  16 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    22.88ms   18.34ms 327.94ms   90.75%
    Req/Sec   190.52k    42.95k  323.34k    70.94%
  362746673 requests in 2.00m, 48.99GB read
Requests/sec: 3022942.17
Transfer/sec:    418.02MB

bildschirmfoto 2014-03-04 um 13 25 43

@normanmaurer

This comment has been minimized.

Copy link
Member

normanmaurer commented Mar 4, 2014

@trustin @daschl please review

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 4, 2014

Build result for #2284 at aa4dc6d: Success

@blucas

This comment has been minimized.

Copy link
Contributor

blucas commented Mar 4, 2014

I know I'm being picky, but the benchmark you've run WITH the PR did not use the same command-line arguments. Would have been nice to compare the results when the benchmarks had the same arguments :)

@normanmaurer

This comment has been minimized.

Copy link
Member

normanmaurer commented Mar 4, 2014

@blucas ups... pasted the wrong benchmark .. Just fixed. The screenshots were correct already . Thanks for catching this!

@trustin

This comment has been minimized.

Copy link
Member

trustin commented Mar 4, 2014

And the perf difference when we have enough arenas configured?

@normanmaurer

This comment has been minimized.

Copy link
Member

normanmaurer commented Mar 4, 2014

@trustin no difference at all with this benchmark (using areas == eventloops).

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 5, 2014

Build result for #2284 at 53035fd: Success

@normanmaurer

This comment has been minimized.

Copy link
Member

normanmaurer commented Mar 5, 2014

@trustin to make you happy I did also run the benchmark again without this PR but with "-Dio.netty.allocator.numDirectArenas=32 -Dio.netty.allocator.numHeapArenas=32" to have the same number of areas as IO threads.

[nmaurer@xxx]~% wrk/wrk  -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' -d 120 -c 256 -t 16 --pipeline 256  http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
  16 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    23.63ms   18.34ms 370.89ms   90.57%
    Req/Sec   184.45k    42.09k  333.07k    70.99%
  351066280 requests in 2.00m, 47.41GB read
Requests/sec: 2925870.47
Transfer/sec:    404.60MB

So my statement before was incorrect. Using the same number of areas is still a bit slower than using the caches. I did run this benchmark a few times and can reproduce it all the time. Also re-run the other benchmarks to and the numbers are the same as before.

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 5, 2014

Build result for #2284 at 7c22edd: Success

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 5, 2014

Build result for #2284 at 68a52f6: Success

@normanmaurer

This comment has been minimized.

Copy link
Member

normanmaurer commented Mar 7, 2014

@trustin any chance you could review and we could merge it ;) Maybe once you have your jet lag sorted :)

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 13, 2014

Build result for #2284 at 5ae85b8: Success

final int chunkSize;
final int subpageOverflowMask;
final int numTinySubpagePools;
final int numSmallSubpagePools;

This comment has been minimized.

@trustin

trustin Mar 13, 2014

Member

num(Tiny|Small)SubpagePools could be static final.

This comment has been minimized.

@trustin

trustin Mar 13, 2014

Member

Actually, numTinySubpagePools only.

This comment has been minimized.

@normanmaurer

normanmaurer Mar 13, 2014

Member

true... Let me fix it


// Check if the cache was tried before which would be the case if it was a tiny or small allocation request.
// In that case there is no need to try again with cached normal allocation.
if (!triedCache) {

This comment has been minimized.

@trustin

trustin Mar 13, 2014

Member

You don't need triedCache at all. You can just do this:

if (isTinyOrSmall(...)) {
  ...
} else if (normCapacity <= chunkSize) {
  if (!cache.allocateNormal(...)) {
    allocateNormal(buf, reqCapacity, normCapacity);
  }
} else { // normCapacity > chunkSize
  allocateHuge(buf, reqCapacity);
}

This comment has been minimized.

@normanmaurer

normanmaurer Mar 13, 2014

Member

Good point... let me fix it :)

This comment has been minimized.

@normanmaurer

normanmaurer Mar 13, 2014

Member

actually after thinking about it this seems to be not really the same. As at the moment (and before) it was possible to even hit the allocateNormal if (s == head)

This comment has been minimized.

@trustin

trustin Mar 13, 2014

Member

Then you can call allocateNormal() there, too. e.g:

if (isTinyOrSmall(...)) {
  ...
  allocateNormal(buf, reqCapacity, normCapacity);
} else if (normCapacity <= chunkSize) {
  if (!cache.allocateNormal(...)) {
    allocateNormal(buf, reqCapacity, normCapacity);
  }
} else { // normCapacity > chunkSize
  allocateHuge(buf, reqCapacity);
}

This comment has been minimized.

@trustin

trustin Mar 13, 2014

Member

Alternatively:

if (isTinyOrSmall(...)) {
  ...
} else if (normCapacity <= chunkSize) {
  if (cache.allocateNormal(...)) {
    return;
  }
} else { // normCapacity > chunkSize
  allocateHuge(buf, reqCapacity);
  return;
}

allocateNormal(buf, reqCapacity, normCapacity);

// 32 kb is the maximum of size which is cached. Similar to what is explained in
// 'Scalable memory allocation using jemalloc'
DEFAULT_MAX_CACHE_BUFFER_SIZE = SystemPropertyUtil.getInt("io.netty.allocator.maxCacheBufferSize", 32 * 1024);

This comment has been minimized.

@trustin

trustin Mar 13, 2014

Member

'the maximum of size which is cached' -> 'the default maximum capacity of the cached buffer'

This comment has been minimized.

@trustin

trustin Mar 13, 2014

Member

io.netty.allocator.maxCacheBufferSize -> io.netty.allocator.maxCachedBufferCapacity

This comment has been minimized.

@trustin

trustin Mar 13, 2014

Member

MAX_CACHE_BUFFER_SIZE -> MAX_CACHED_BUFFER_CAPACITY

PoolThreadCache(PoolArena<byte[]> heapArena, PoolArena<ByteBuffer> directArena) {
PoolThreadCache(PoolArena<byte[]> heapArena, PoolArena<ByteBuffer> directArena,
int tinyCacheSize, int smallCacheSize, int normalCacheSize,
int maxCacheSize, int maxCacheArraySize) {

This comment has been minimized.

@trustin

trustin Mar 13, 2014

Member

maxCacheSize -> maxCachedBufferCapacity

This comment has been minimized.

@trustin

trustin Mar 13, 2014

Member

What is maxCacheArraySize? Could you elaborate?

This comment has been minimized.

@normanmaurer

normanmaurer Mar 13, 2014

Member

the name is wrong I guess... IT is the max number of entries in the array for normal allocation.

}

// TODO: Find a better name
private static int normalCacheIdx(int val) {

This comment has been minimized.

@trustin

trustin Mar 13, 2014

Member

log2()?


// Used for bitshifting when calculate the index of normal caches later
private final int valNormalDirect;
private final int valNormalHeap;

This comment has been minimized.

@trustin

trustin Mar 13, 2014

Member

val -> numShifts (if I guessed correctly)

}

/**
* Cache used for buffers which are backed by TINY or SMALL size.

This comment has been minimized.

@normanmaurer

normanmaurer Mar 13, 2014

Member

I just notice this needs to get on SubPagePoolChunkCache and the one on SubPagePoolChunkCache here ...

/**
* Cache used for buffers which are backed by NORMAL size.
*
* @param <T>

This comment has been minimized.

@trustin

trustin Mar 13, 2014

Member

Remove @param if you are not going to document it.

This comment has been minimized.

@trustin

trustin Mar 13, 2014

Member

Same for other @param tags in this pull requests

@trustin

This comment has been minimized.

Copy link
Member

trustin commented Mar 13, 2014

I don't think the counter can be thread-local as the whole point is to free up stuff from there if the Thread stops to allocate out of the cache frequent enough. If we make it thread-local only and only check during allocation path we will never free stuff.

Well, I see no problem. We always consult the thread-local cache before allocating from the arena. We don't need to increase the counter only when something was allocated via thread-local cache but also when an attempt for allocation was made on thread-local cache.

@normanmaurer

This comment has been minimized.

Copy link
Member

normanmaurer commented Mar 13, 2014

@trustin I was actually trying to guard against both cases. This is why I choose the interval like solution ... But maybe it would be could enough to just count the allocations per cache and do the cleanup every X allocations. So just like you propose.... I only wanted to make sure you understand why I did it and what are my concerns ;)

@trustin

This comment has been minimized.

Copy link
Member

trustin commented Mar 13, 2014

Yeah, I see where you are coming from again. One case I can think of right now is when an event loop goes completely idle (i.e. no traffic) after sudden large traffic. It's not very likely in a modern services, but yeah, it can happen.

How about exposing some public static methods for better control, such as:

  • One that allows a user to trigger an incremental clean-up on the current thread-local cache.
  • One that tells if the current thread has a thread-local cache.

Then we can always add periodic invocation to the methods mentioned above if this becomes a common problem. Otherwise, only the users who have the problem could call them, until we find the best way to implement periodic incremental clean-up elegantly. WDYT?

@normanmaurer

This comment has been minimized.

Copy link
Member

normanmaurer commented Mar 13, 2014

@trustin I guess this could work... But I think we should just do the "per thread-local cache" free-up after x allocations for now. sounds good ?

@trustin

This comment has been minimized.

Copy link
Member

trustin commented Mar 13, 2014

SGTM. Thanks!

executor.terminationFuture().addListener(new FutureListener<Object>() {
@Override
public void operationComplete(Future<Object> future) throws Exception {
cache.free();

This comment has been minimized.

@normanmaurer

normanmaurer Mar 13, 2014

Member

@trustin so one question remains what to do with this one ? Just remove it for now ?

This comment has been minimized.

@trustin

trustin Mar 13, 2014

Member

Let's remove it as well as EventExecutors, etc.

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 13, 2014

Build result for #2284 at 5e2b9d2: Success

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 13, 2014

Build result for #2284 at ce0ab32: Success

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 13, 2014

Build result for #2284 at fd4045c: Success

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 13, 2014

Build result for #2284 at fc4b15510ba7a6d2bb467fcdf86da1119db75ebc: Success

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 13, 2014

Build result for #2284 at acfc131: Success

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 14, 2014

Build result for #2284 at 2cb78ae: Success

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 14, 2014

Build result for #2284 at bcd9e09: Success

@normanmaurer

This comment has been minimized.

Copy link
Member

normanmaurer commented Mar 14, 2014

@trustin ready for final review :)

@normanmaurer

This comment has been minimized.

Copy link
Member

normanmaurer commented Mar 17, 2014

@trustin ping... could you please do a final review ? thanks!

int cacheSize, int maxCachedBufferCapacity, PoolArena<T> area) {
if (cacheSize > 0) {
int max = Math.min(area.chunkSize, maxCachedBufferCapacity);
int arraySize = max / area.pageSize;

This comment has been minimized.

@trustin

trustin Mar 18, 2014

Member

I guess we need to ensure this is greater than 1 or something small.

This comment has been minimized.

@trustin

trustin Mar 20, 2014

Member

This was not addressed yet.

cache.free();
}

void freeUpIfNecessary() {

This comment has been minimized.

@trustin

trustin Mar 18, 2014

Member

How about renaming to something else like trim() or cleanup()? It's different from free() even when it's necessary, so I think it shouldn't be called free..().

int free = size() - maxEntriesInUse;

if (free <= maxUnusedCached) {
return;

This comment has been minimized.

@trustin

trustin Mar 18, 2014

Member

We still have to reset entriesInUse and maxEntriesInUse before returning.

logger.debug("-Dio.netty.allocator.smallCacheSize: {}", DEFAULT_SMALL_CACHE_SIZE);
logger.debug("-Dio.netty.allocator.normalCacheSize: {}", DEFAULT_NORMAL_CACHE_SIZE);
logger.debug("-Dio.netty.allocator.maxCachedBufferCapacity: {}", DEFAULT_MAX_CACHED_BUFFER_CAPACITY);
logger.debug("-Dio.netty.allocator.cacheFreeSweepAllocationThreshold: {}",

This comment has been minimized.

@trustin

trustin Mar 18, 2014

Member

I don't think a user will understand what this property is for from its name. Do you have a better name? How about: io.netty.allocator.cacheCleanUpInterval or io.netty.allocator.cacheTrimInterval?

@normanmaurer

This comment has been minimized.

Copy link
Member

normanmaurer commented Mar 18, 2014

@trustin addressed all your comments.. anything else or can we merge ?

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 18, 2014

Build result for #2284 at 29a613b: Success

@trustin

This comment has been minimized.

Copy link
Member

trustin commented Mar 20, 2014

Please merge once you address the last comment. Thanks for your patience.

Norman Maurer
Implement Thread caches for pooled buffers to minimize conditions. Th…
…is fixes [#2264] and [#808].

Motivation:
Remove the synchronization bottleneck in PoolArena and so speed up things

Modifications:

This implementation uses kind of the same technics as outlined in the jemalloc paper and jemalloc
blogpost https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919.

At the moment we only cache for "known" Threads (that powers EventExecutors) and not for others to keep the overhead
minimal when need to free up unused buffers in the cache and free up cached buffers once the Thread completes. Here
we use multi-level caches for tiny, small and normal allocations. Huge allocations are not cached at all to keep the
memory usage at a sane level. All the different cache configurations can be adjusted via system properties or the constructor
directly where it makes sense.

Result:
Less conditions as most allocations can be served by the cache itself
@normanmaurer

This comment has been minimized.

Copy link
Member

normanmaurer commented Mar 20, 2014

@trustin all addressed and merged into 4.0, 4.1 and master. Thanks for all the comments buddy!

@@ -23,14 +23,16 @@

abstract class PoolArena<T> {

static final int numTinySubpagePools = 512 >>> 4;

This comment has been minimized.

@viktorklang

viktorklang Mar 20, 2014

LOL, something wrong with 32 ? :)

This comment has been minimized.

@normanmaurer

normanmaurer Mar 20, 2014

Member

haha, we love this kind of stuff ;) actually it was there already I just moved it ... So blame @trustin =P

This comment has been minimized.

@viktorklang

This comment has been minimized.

@daschl

daschl Mar 20, 2014

Member

it's not l33t enough.

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 20, 2014

Build result for #2284 at d27679f: Success

@normanmaurer normanmaurer deleted the pool_caches branch Apr 15, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment