Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock contention in QueueImpl #837

Closed
wants to merge 1 commit into from
Closed

Lock contention in QueueImpl #837

wants to merge 1 commit into from

Conversation

chamacie
Copy link
Contributor

@chamacie chamacie commented Feb 7, 2013

This is the follow up on #831 .

I couldn't find a way to reopen the old PR, so a new one.
And I got a bit lost with Git. I'm learning that on the fly here, so I hope everything went fine. Although I'm not 100% sure.

In any case, I ran the test suite as specified with this change set and they all passed.

@HornetQBot
Copy link

Can one of the admins verify this patch?

@clebertsuconic
Copy link
Member

ok to test

@clebertsuconic
Copy link
Member

I have imported your commit into a branch on my fork... master-chamacie... I'm running a couple of tests on this before we merge it...

I want to hear from Francisco and Andy on this one as well (after I finish the tests)

@ghost ghost assigned clebertsuconic Feb 8, 2013
@clebertsuconic
Copy link
Member

Sorry for the internal link.. but just as an utility for us, this is the link for the full testsuite I'm running now:

http://messaging-09.jbm.lab.bos.redhat.com:8080/job/hornetq-param/192/

will come back to you guys tomorrow on this after the test was finished, while I'm on sleep mode.

@chamacie
Copy link
Contributor Author

chamacie commented Feb 8, 2013

No problem.

btw:
I cleaned up a stack dump where you can see one the problems I'm trying to solve. Only the HornetQ threads are listed. You can see that Thread-24 has the lock and every Old I/O Server worker is blocked waiting for that lock.
https://docs.google.com/file/d/0B0t767l6p9pbd2ZtVjRReUJuNnM/edit

The other issue is the contention on ServerMessageImpl.
I won't upload the results from the profiled runs, but the ServerMessageImpl.incrementRefCount and decrementRefCount popped up as seriously contented code blocks.

@clebertsuconic
Copy link
Member

I'm not sure what kind of tests are you doing... but we haven't seen these issues in most of the benchmarks.

We certainly need locks in certain places, but they are part of the logic on distributing messages. But we will certainly improve things whenever we find better ways.

But this code itself is not valid: Something wrong here:

07:18:07,134 ERROR [org.hornetq.core.server] HQ224016: Caught exception: java.lang.UnsupportedOperationException
at java.util.concurrent.CopyOnWriteArrayList$COWIterator.remove(CopyOnWriteArrayList.java:1004) [rt.jar:1.6.0_27]
at org.hornetq.core.server.impl.QueueImpl.cancelRedistributor(QueueImpl.java:665) [:]
at org.hornetq.core.server.impl.QueueImpl.addConsumer(QueueImpl.java:572) [:]
at org.hornetq.core.server.impl.ServerConsumerImpl.(ServerConsumerImpl.java:179) [:]
at org.hornetq.core.server.impl.ServerSessionImpl.createConsumer(ServerSessionImpl.java:345) [:]
at org.hornetq.core.protocol.core.ServerSessionPacketHandler.handlePacket(ServerSessionPacketHandler.java:213) [:]
at org.hornetq.core.protocol.core.impl.ChannelImpl.handlePacket(ChannelImpl.java:626) [:]
at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.doBufferReceived(RemotingConnectionImpl.java:547) [:]
at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.bufferReceived(RemotingConnectionImpl.java:523) [:]
at org.hornetq.core.remoting.server.impl.RemotingServiceImpl$DelegatingBufferHandler.bufferReceived(RemotingServiceImpl.java:562) [:]
at org.hornetq.core.remoting.impl.netty.HornetQChannelHandler.messageReceived(HornetQChannelHandler.java:72) [:]
at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88) [netty-3.6.2.Final.jar:]
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560) [netty-3.6.2.Final.jar:]
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787) [netty-3.6.2.Final.jar:]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:281) [netty-3.6.2.Final.jar:]
at org.hornetq.core.remoting.impl.netty.HornetQFrameDecoder2.decode(HornetQFrameDecoder2.java:169) [:]
at org.hornetq.core.remoting.impl.netty.HornetQFrameDecoder2.messageReceived(HornetQFrameDecoder2.java:134) [:]
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) [netty-3.6.2.Final.jar:]
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560) [netty-3.6.2.Final.jar:]
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:555) [netty-3.6.2.Final.jar:]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) [netty-3.6.2.Final.jar:]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) [netty-3.6.2.Final.jar:]
at org.jboss.netty.channel.socket.oio.OioWorker.process(OioWorker.java:71) [netty-3.6.2.Final.jar:]
at org.jboss.netty.channel.socket.oio.AbstractOioWorker.run(AbstractOioWorker.java:73) [netty-3.6.2.Final.jar:]
at org.jboss.netty.channel.socket.oio.OioWorker.run(OioWorker.java:51) [netty-3.6.2.Final.jar:]
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [netty-3.6.2.Final.jar:]
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [netty-3.6.2.Final.jar:]
at org.jboss.netty.util.VirtualExecutorService$ChildExecutorRunnable.run(VirtualExecutorService.java:175) [netty-3.6.2.Final.jar:]
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [rt.jar:1.6.0_27]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [rt.jar:1.6.0_27]
at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_27]

@clebertsuconic
Copy link
Member

you should open a Developer's forum here.. you should start by sharing your tests:

https://community.jboss.org/en/hornetq/dev?view=discussions

I will close this as it's not possible to merge this at this point. We should have a dev discussion

@chamacie
Copy link
Contributor Author

chamacie commented Feb 8, 2013

Ah that is the same error as the previous one but then at a different line of code. Apparently the quick tests don't cover cancelling a redistributor. Can you tell me the test suite or the test which is failing? Then I can verify a fix.

Getting back to your request for sharing the tests. That is impossible as they contain proprietary code. I can tell you what we were are testing.
The scenario is putting the server under load with a couple of hundred of consumers and producers. You have to forgive me but I really can't talk about numbers here. Then we simulate one consumer host machine horribly crashing. What we want to tweak for is that this has a little effect as possible on the delivery time of messages going the other consumers. I.e. one slow/crashing consumer should not make the other ones slow as well.
As you can see the backlog of messages that still need to be send to the failing consumer is nicely being cleaned up. However new messages cannot be delivered to their Queues as the Queue for the failing consumer is completely blocked (for some time of course) and cannot answer if a new message is applicable for him or not. This prevents the message from moving forward in the broker.
Zooming out to the entire client-server landscape, what we see is that a single crashing consumer has a knock on effect on the latency for the entire system. This changeset is something that popped up which can be improved on the HornetQ side in that respect.
To give some background, we're measuring the delivery time in sub milliseconds so we are more interested in latency then throughput.

I'll make a topic tomorrow to discuss this.

@clebertsuconic
Copy link
Member

You could easily fix the crashing case by disabling direct delivery.

Ultimately you could use NIO which will take a non-blocking approach.

those two options seems the proper fix for your case.

Regarding the test you could write a single test to emulate what you are seeing. Usually when you raise an issue you have to provide a test.

@chamacie
Copy link
Contributor Author

chamacie commented Feb 8, 2013

We moved to NIO. In our setup non-blocking was not performant enough.
Yet the problem remains that one slow consumer makes others slow as well. At least we could prevent it from taking down HornetQ by OOM as well, so that is good.

About the test, I wouldn't know how to write a 'small' failing integration test which could run in Maven. We can currently only reproduce this issue in our load test environment. I'm open for suggestions though :) I would need to run two separate consumers, one producer and a server. Then only one of the consumers should be delayed somehow, for instance by adding an incremental delay in Netty. Is there a similar test somewhere in HornetQ maybe?

Getting back to which test fails, I still need to know which regression test is failing. Making a new failing test and fixing it will not ensure that the regression test which is now failing, will pass.

@clebertsuconic
Copy link
Member

It's a general failure on the main testsuite.. I think this is on the readme, but you can run the entire testsuite by doing this:

mvn -Phudson-tests compile findbugs:check test

@clebertsuconic
Copy link
Member

Regarding the performance test.. you could write an isolated test for this performance issue you're seeing.

@chamacie
Copy link
Contributor Author

chamacie commented Feb 8, 2013

Getting the problem to pop up in isolation is one of the problems, but I'll see what I can do.

@clebertsuconic
Copy link
Member

you're having multiple producers with multiple consumers with some failing
scenario.

I'm not really sure what's going on there.

How many message / second are you getting? What transaction mode are you
using? Persistent / non persistent?

On Fri, Feb 8, 2013 at 9:18 AM, Steven Hulshof notifications@github.comwrote:

Getting the problem to pop up in isolation is one of the problems, but
I'll see what I can do.


Reply to this email directly or view it on GitHubhttps://github.com//pull/837#issuecomment-13294622.

Clebert Suconic
http://community.jboss.org/people/clebert.suconic@jboss.com
http://clebertsuconic.blogspot.com

@clebertsuconic
Copy link
Member

That's why I said we should move to dev-forum... this discussion will be lost on this PR.. I don't think it's very accessible for future references.

@chamacie
Copy link
Contributor Author

chamacie commented Feb 8, 2013

Agreed. I'll open one later today or tomorrow.

@chamacie
Copy link
Contributor Author

chamacie commented Feb 9, 2013

https://community.jboss.org/thread/221122
Shall I create a new PR or do you want to discuss things first?

@clebertsuconic
Copy link
Member

lets discuss the changes you want to make first.. then we can get to a PR after agreed.

I could even download your branch and do tests.. lets keep the discussion on the forum until this is polished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants