Write optimized Queue implementation for NioEventLoop #1259

Closed
normanmaurer opened this Issue Apr 9, 2013 · 16 comments

Projects

None yet

8 participants

@laczoka
laczoka commented Apr 9, 2013

+1

I was going to suggest this myself for the buffer handling infrastructure, but I wasn't sure if Netty 4.0 didn't already have these ideas backed in :)

@hepin1989

I also noticed that you add some pading in the current implement somewhere,but you just comment it.and comment as need to test.The padding may speed up the queue.
http://martinfowler.com/articles/lmax.html
https://code.google.com/p/disruptor/wiki/BlogsAndArticles
http://lmax-exchange.github.io/disruptor/
http://www.research.ibm.com/people/m/michael/podc-1996.pdf
http://www.codeproject.com/Articles/43510/Lock-Free-Single-Producer-Single-Consumer-Circular
this links may still be helpful,really looking forward for this.

public long p1, p2, p3, p4, p5, p6, p7; // cache line padding
private volatile long cursor = INITIAL_CURSOR_VALUE;
public long p8, p9, p10, p11, p12, p13, p14; // cache line padding

@trustin
Member
trustin commented Apr 10, 2013

Great to see people are interested. Yeah, it's one of the many places we can improve by large margin. Feel free to jump in! :-)

@Tochemey

I am not so good in the netty core api. I have used so far in some protocol implementation and I quite like it. Seeing that you guys are talking about Optimized queue implementation in way of producer-consumer I would like to suggest whether it is possible for you geeks of the netty development team to have a look at the LMX disruptor framework particularly their Ring Buffer approach and see if it can fit in for netty. I think It will be great plus for the framework.

@hepin1989

hi,here is a simple test via disruptor

pekall@pekall:~/disruptor$ ./gradlew perfThroughput
:compileJava UP-TO-DATE
:processResources UP-TO-DATE
:classes UP-TO-DATE
:compilePerfJava UP-TO-DATE
:processPerfResources UP-TO-DATE
:perfClasses UP-TO-DATE
:perfThroughput

com.lmax.disruptor.ThreePublisherToOneProcessorSequencedThroughputTest > shouldCompareDisruptorVsQueues STANDARD_OUT
    Skipping Queue tests
    Starting Disruptor tests
    Run 0, Disruptor=13,262,599 ops/sec
    Run 1, Disruptor=13,717,421 ops/sec
    Run 2, Disruptor=12,804,097 ops/sec
    Run 3, Disruptor=12,690,355 ops/sec
    Run 4, Disruptor=13,262,599 ops/sec
    Run 5, Disruptor=12,650,221 ops/sec
    Run 6, Disruptor=13,114,754 ops/sec
    Run 7, Disruptor=13,012,361 ops/sec
    Run 8, Disruptor=12,492,192 ops/sec
    Run 9, Disruptor=12,468,827 ops/sec
    Run 10, Disruptor=12,430,080 ops/sec
    Run 11, Disruptor=11,983,223 ops/sec
    Run 12, Disruptor=12,315,270 ops/sec
    Run 13, Disruptor=12,903,225 ops/sec
    Run 14, Disruptor=12,738,853 ops/sec
    Run 15, Disruptor=11,827,321 ops/sec
    Run 16, Disruptor=11,911,852 ops/sec
    Run 17, Disruptor=12,091,898 ops/sec
    Run 18, Disruptor=12,143,290 ops/sec
    Run 19, Disruptor=12,048,192 ops/sec

com.lmax.disruptor.OnePublisherToOneProcessorRawThroughputTest > shouldCompareDisruptorVsQueues STANDARD_OUT
    Skipping Queue tests
    Starting Disruptor tests
    Run 0, Disruptor=240,384,615 ops/sec
    Run 1, Disruptor=186,915,887 ops/sec
    Run 2, Disruptor=186,393,289 ops/sec
    Run 3, Disruptor=185,356,811 ops/sec
    Run 4, Disruptor=173,913,043 ops/sec
    Run 5, Disruptor=182,982,616 ops/sec
    Run 6, Disruptor=216,684,723 ops/sec
    Run 7, Disruptor=161,550,888 ops/sec
    Run 8, Disruptor=151,057,401 ops/sec
    Run 9, Disruptor=240,384,615 ops/sec
    Run 10, Disruptor=171,673,819 ops/sec
    Run 11, Disruptor=250,000,000 ops/sec
    Run 12, Disruptor=183,992,640 ops/sec
    Run 13, Disruptor=168,491,996 ops/sec
    Run 14, Disruptor=151,400,454 ops/sec
    Run 15, Disruptor=170,212,765 ops/sec
    Run 16, Disruptor=180,342,651 ops/sec
    Run 17, Disruptor=195,121,951 ops/sec
    Run 18, Disruptor=183,654,729 ops/sec
    Run 19, Disruptor=172,860,847 ops/sec

com.lmax.disruptor.OnePublisherToThreeProcessorDiamondThroughputTest > shouldCompareDisruptorVsQueues STANDARD_OUT
    Skipping Queue tests
    Starting Disruptor tests
    Run 0, Disruptor=105,485,232 ops/sec
    Run 1, Disruptor=81,967,213 ops/sec
    Run 2, Disruptor=31,436,655 ops/sec
    Run 3, Disruptor=32,488,628 ops/sec
    Run 4, Disruptor=31,496,062 ops/sec
    Run 5, Disruptor=31,605,562 ops/sec
    Run 6, Disruptor=32,981,530 ops/sec
    Run 7, Disruptor=32,883,919 ops/sec
    Run 8, Disruptor=32,573,289 ops/sec
    Run 9, Disruptor=32,776,138 ops/sec
    Run 10, Disruptor=32,123,353 ops/sec
    Run 11, Disruptor=31,836,994 ops/sec
    Run 12, Disruptor=31,655,587 ops/sec
    Run 13, Disruptor=32,786,885 ops/sec
    Run 14, Disruptor=32,927,230 ops/sec
    Run 15, Disruptor=33,500,837 ops/sec
    Run 16, Disruptor=32,102,728 ops/sec
    Run 17, Disruptor=33,036,009 ops/sec
    Run 18, Disruptor=31,959,092 ops/sec
    Run 19, Disruptor=31,857,279 ops/sec

com.lmax.disruptor.OnePublisherToThreeWorkerPoolThroughputTest > shouldCompareDisruptorVsQueues STANDARD_OUT
    Skipping Queue tests
    Starting Disruptor tests
    Run 0, Disruptor=15,216,068 ops/sec
    Run 1, Disruptor=12,695,188 ops/sec
    Run 2, Disruptor=13,821,700 ops/sec
    Run 3, Disruptor=13,875,398 ops/sec
    Run 4, Disruptor=14,398,848 ops/sec
    Run 5, Disruptor=14,295,925 ops/sec
    Run 6, Disruptor=14,251,104 ops/sec
    Run 7, Disruptor=14,654,161 ops/sec
    Run 8, Disruptor=13,808,340 ops/sec
    Run 9, Disruptor=14,271,442 ops/sec
    Run 10, Disruptor=14,386,419 ops/sec
    Run 11, Disruptor=14,371,945 ops/sec
    Run 12, Disruptor=14,269,406 ops/sec
    Run 13, Disruptor=14,066,676 ops/sec
    Run 14, Disruptor=14,378,145 ops/sec
    Run 15, Disruptor=14,271,442 ops/sec
    Run 16, Disruptor=14,390,559 ops/sec
    Run 17, Disruptor=14,314,342 ops/sec
    Run 18, Disruptor=14,371,945 ops/sec
    Run 19, Disruptor=14,522,218 ops/sec

com.lmax.disruptor.OnePublisherToThreeProcessorMultiCastThroughputTest > shouldCompareDisruptorVsQueues STANDARD_OUT
    Skipping Queue tests
    Starting Disruptor tests
    Run 0, Disruptor=27,624,309 ops/sec
    Run 1, Disruptor=27,616,680 ops/sec
    Run 2, Disruptor=59,808,612 ops/sec
    Run 3, Disruptor=59,952,038 ops/sec
    Run 4, Disruptor=59,241,706 ops/sec
    Run 5, Disruptor=62,150,403 ops/sec
    Run 6, Disruptor=74,682,598 ops/sec
    Run 7, Disruptor=61,804,697 ops/sec
    Run 8, Disruptor=58,038,305 ops/sec
    Run 9, Disruptor=66,269,052 ops/sec
    Run 10, Disruptor=65,703,022 ops/sec
    Run 11, Disruptor=60,386,473 ops/sec
    Run 12, Disruptor=65,061,808 ops/sec
    Run 13, Disruptor=58,411,214 ops/sec
    Run 14, Disruptor=61,919,504 ops/sec
    Run 15, Disruptor=62,774,639 ops/sec
    Run 16, Disruptor=66,269,052 ops/sec
    Run 17, Disruptor=74,850,299 ops/sec
    Run 18, Disruptor=64,474,532 ops/sec
    Run 19, Disruptor=67,934,782 ops/sec

com.lmax.disruptor.OnePublisherToOneProcessorRawBatchThroughputTest > shouldCompareDisruptorVsQueues STANDARD_OUT
    Skipping Queue tests
    Starting Disruptor tests
    Run 0, Disruptor=288,808,664 ops/sec
    Run 1, Disruptor=3,246,753,246 ops/sec
    Run 2, Disruptor=3,636,363,636 ops/sec
    Run 3, Disruptor=3,412,969,283 ops/sec
    Run 4, Disruptor=2,427,184,466 ops/sec
    Run 5, Disruptor=3,164,556,962 ops/sec
    Run 6, Disruptor=1,947,419,668 ops/sec
    Run 7, Disruptor=3,284,072,249 ops/sec
    Run 8, Disruptor=1,841,620,626 ops/sec
    Run 9, Disruptor=3,412,969,283 ops/sec
    Run 10, Disruptor=3,649,635,036 ops/sec
    Run 11, Disruptor=3,472,222,222 ops/sec
    Run 12, Disruptor=2,857,142,857 ops/sec
    Run 13, Disruptor=3,642,987,249 ops/sec
    Run 14, Disruptor=3,355,704,697 ops/sec
    Run 15, Disruptor=3,372,681,281 ops/sec
    Run 16, Disruptor=3,367,003,367 ops/sec
    Run 17, Disruptor=1,309,757,694 ops/sec
    Run 18, Disruptor=1,960,784,313 ops/sec
    Run 19, Disruptor=3,669,724,770 ops/sec

com.lmax.disruptor.OnePublisherToOneProcessorUniCastThroughputTest > shouldCompareDisruptorVsQueues STANDARD_OUT
    Skipping Queue tests
    Starting Disruptor tests
    Run 0, Disruptor=54,764,512 ops/sec
    Run 1, Disruptor=51,840,331 ops/sec
    Run 2, Disruptor=87,183,958 ops/sec
    Run 3, Disruptor=66,622,251 ops/sec
    Run 4, Disruptor=50,838,840 ops/sec
    Run 5, Disruptor=50,581,689 ops/sec
    Run 6, Disruptor=68,166,325 ops/sec
    Run 7, Disruptor=68,166,325 ops/sec
    Run 8, Disruptor=51,098,620 ops/sec
    Run 9, Disruptor=71,581,961 ops/sec
    Run 10, Disruptor=71,326,676 ops/sec
    Run 11, Disruptor=68,728,522 ops/sec
    Run 12, Disruptor=51,229,508 ops/sec
    Run 13, Disruptor=50,226,017 ops/sec
    Run 14, Disruptor=49,701,789 ops/sec
    Run 15, Disruptor=71,684,587 ops/sec
    Run 16, Disruptor=68,823,124 ops/sec
    Run 17, Disruptor=51,203,277 ops/sec
    Run 18, Disruptor=51,706,308 ops/sec
    Run 19, Disruptor=50,787,201 ops/sec

com.lmax.disruptor.OnePublisherToThreeProcessorPipelineThroughputTest > shouldCompareDisruptorVsQueues STANDARD_OUT
    Skipping Queue tests
    Starting Disruptor tests
    Run 0, Disruptor=38,971,161 ops/sec
    Run 1, Disruptor=33,456,005 ops/sec
    Run 2, Disruptor=41,169,205 ops/sec
    Run 3, Disruptor=39,510,075 ops/sec
    Run 4, Disruptor=39,184,952 ops/sec
    Run 5, Disruptor=41,152,263 ops/sec
    Run 6, Disruptor=38,240,917 ops/sec
    Run 7, Disruptor=39,184,952 ops/sec
    Run 8, Disruptor=41,271,151 ops/sec
    Run 9, Disruptor=39,525,691 ops/sec
    Run 10, Disruptor=37,579,857 ops/sec
    Run 11, Disruptor=40,716,612 ops/sec
    Run 12, Disruptor=37,893,141 ops/sec
    Run 13, Disruptor=36,941,263 ops/sec
    Run 14, Disruptor=36,310,820 ops/sec
    Run 15, Disruptor=41,118,421 ops/sec
    Run 16, Disruptor=40,273,862 ops/sec
    Run 17, Disruptor=40,933,278 ops/sec
    Run 18, Disruptor=40,700,040 ops/sec
    Run 19, Disruptor=40,749,796 ops/sec


@normanmaurer
Member

Some new article which is related to it and looks promising:

http://psy-lob-saw.blogspot.de/2013/10/lock-free-mpsc-1.html

@CodeMason

Have you thought about using the Disruptor? The performance benchmarks look
really amazing.

On Mon, Oct 21, 2013 at 3:12 AM, Norman Maurer notifications@github.comwrote:

Some new article which is related to it and looks promising:

http://psy-lob-saw.blogspot.de/2013/10/lock-free-mpsc-1.html


Reply to this email directly or view it on GitHubhttps://github.com/netty/netty/issues/1259#issuecomment-26696600
.

@daschl
Member
daschl commented Jan 31, 2014

Hey folks.. for fun I started playing around with it a bit..

To start off (and yes I know this is not the best way to utilize it by any means), I wrapped the disruptor in a simple queue interface and extended the existing eventloop stuff. I'm sure there is a better way to get the lowest sequence with barriers or so, this one was just a simple one that came to my mind.

See: daschl@7138002

I ran a wrk benchmark against the helloworld example and while the throughput was the same, I saw less cpu load (11% instead of 30% and less kernel cpu indicating less context switches).

To really get the best out of it we'd need to rearchitect it all in a way so that both batch add and batch poll are possible. Only then we get the real performance benefit.. Note that we are IO bound anyway, so it will - IMHO - only show when you have LOTS of events and less resource usage in general.

I'll put more work into this soon, but if you're curious play around with it.

@normanmaurer
Member

With the same throughput?

Am 31.01.2014 um 16:48 schrieb Michael Nitschinger notifications@github.com:

Hey folks.. for fun I started playing around with it a bit..

To start off (and yes I know this is not the best way to utilize it by any means), I wrapped the disruptor in a simple queue interface and extended the existing eventloop stuff.

See: daschl@7138002

I ran a wrk benchmark against the helloworld example and while the throughput was the same, I saw less cpu load (11% instead of 30% and less kernel cpu indicating less context switches).

I'll put more work into this soon, but if you're curious play around with it.


Reply to this email directly or view it on GitHub.

@daschl
Member
daschl commented Jan 31, 2014

Yeah, but I just checked it was a flaw.. when I only run it with 1 thread in the eventloop its the same, so I didn't push it hard enough yet.. I'll run some larger numbers soon.. but if you could try it against your canonical HTTP ultimate perf test that would be fun!

@normanmaurer
Member

I will run it in the lab and let you know

Am 31.01.2014 um 17:03 schrieb Michael Nitschinger notifications@github.com:

Yeah, but I just checked it was a flaw.. when I only run it with 1 thread in the eventloop its the same, so I didn't push it hard enough yet.. I'll run some larger numbers soon.. but if you could try it against your canonical HTTP ultimate perf test that would be fun!


Reply to this email directly or view it on GitHub.

@daschl
Member
daschl commented Jan 31, 2014

Note there is a high chance it's pretty bad ;) a real impl needs to be very different from the event loop with batching and notifications and all that. It's fun nevertheless hehe

@daschl
Member
daschl commented Jan 31, 2014

Also make sure to run the workers with 1 thread and not N as in the eventloop!

@normanmaurer
Member

@daschl Also I think the impl would only real make a bit difference if we would write trigger things from outside the EventLoop. Otherwise the queue will not really used often.

@normanmaurer normanmaurer added a commit that referenced this issue Feb 26, 2014
@normanmaurer normanmaurer [#1259] Add optimized queue for SCMP pattern and use it in NIO and na…
…tive transport

This queue also produces less GC then CLQ when make use of OneTimeTask
b8e5eb7
@normanmaurer normanmaurer added a commit that referenced this issue Feb 27, 2014
@normanmaurer normanmaurer [#1259] Add optimized queue for SCMP pattern and use it in NIO and na…
…tive transport

This queue also produces less GC then CLQ when make use of OneTimeTask
bdedde1
@normanmaurer normanmaurer added a commit that referenced this issue Feb 27, 2014
@normanmaurer normanmaurer [#1259] Add optimized queue for SCMP pattern and use it in NIO and na…
…tive transport

This queue also produces less GC then CLQ when make use of OneTimeTask
d3ffa1b
@normanmaurer normanmaurer added a commit that referenced this issue Feb 27, 2014
@normanmaurer normanmaurer [#1259] Add optimized queue for SCMP pattern and use it in NIO and na…
…tive transport

This queue also produces less GC then CLQ when make use of OneTimeTask
6efac61
@plokhotnyuk

Here is optimized version of Scalaz actor that encouraged @viktorklang to implement fast MPSC mailbox one year ago:
https://github.com/plokhotnyuk/actors/blob/c9f850615484c171ebfe43adf1fcf5422a9fff61/src/test/scala/com/github/plokhotnyuk/actors/Actor2.scala

It is able enqueue/dequeue with throughput 68M/426M msg/sec accordingly when running by single thread:
https://github.com/plokhotnyuk/actors/blob/c9f850615484c171ebfe43adf1fcf5422a9fff61/out2_poolSize1.txt#L263

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment