JGroups rejects requests to send messages when it is overloaded #63

timpokorny · 2015-01-12T15:31:59Z

Summary

Under very heavy load, Portico can crash thanks to JGroups refusing to be able to send any more messages. This happens when its sender thread pool is completely filled up. The rejection policy for tasks seems to be "reject" or something similar, when it should be "execute" so the task just gets executed on the thread submitting the job.

The following is a stack trace showing the problem when running the wantest federate (immediate callback mode - two federate federation running on the same computer - very fast send speeds).

ERROR [main] portico.lrc: org.portico.lrc.compat.JRTIinternalError: Problem sending message: channel=WAN Test Federation, error message=Task org.jgroups.protocols.TP$5@7fd751de rejected from java.util.concurrent.ThreadPoolExecutor@e162a35[Running, pool size = 10, active threads = 7, queued tasks = 995, completed tasks = 193678]
Exception in thread "main" hla.rti1516e.exceptions.RTIinternalError: org.portico.lrc.compat.JRTIinternalError: Problem sending message: channel=WAN Test Federation, error message=Task org.jgroups.protocols.TP$5@7fd751de rejected from java.util.concurrent.ThreadPoolExecutor@e162a35[Running, pool size = 10, active threads = 7, queued tasks = 995, completed tasks = 193678]
    at org.portico.impl.hla1516e.Rti1516eAmbassador.updateAttributeValues(Rti1516eAmbassador.java:2081)
    at wantest.throughput.ThroughputDriver.loop(ThroughputDriver.java:252)
    at wantest.throughput.ThroughputDriver.execute(ThroughputDriver.java:122)
    at wantest.Federate.execute(Federate.java:111)
    at wantest.Main.main(Main.java:49)
Caused by: org.portico.lrc.compat.JRTIinternalError: Problem sending message: channel=WAN Test Federation, error message=Task org.jgroups.protocols.TP$5@7fd751de rejected from java.util.concurrent.ThreadPoolExecutor@e162a35[Running, pool size = 10, active threads = 7, queued tasks = 995, completed tasks = 193678]
    at org.portico.bindings.jgroups.channel.FederationChannel.send(FederationChannel.java:252)
    at org.portico.bindings.jgroups.JGroupsConnection.broadcast(JGroupsConnection.java:153)
    at org.portico.lrc.services.object.handlers.outgoing.UpdateAttributesHandler.process(UpdateAttributesHandler.java:105)
    at org.portico.utils.messaging.MessageSink.process(MessageSink.java:187)
    at org.portico.impl.hla1516e.Impl1516eHelper.processMessage(Impl1516eHelper.java:99)
    at org.portico.impl.hla1516e.Rti1516eAmbassador.processMessage(Rti1516eAmbassador.java:5554)
    at org.portico.impl.hla1516e.Rti1516eAmbassador.updateAttributeValues(Rti1516eAmbassador.java:2063)
    ... 4 more
Caused by: java.util.concurrent.RejectedExecutionException: Task org.jgroups.protocols.TP$5@7fd751de rejected from java.util.concurrent.ThreadPoolExecutor@e162a35[Running, pool size = 10, active threads = 7, queued tasks = 995, completed tasks = 193678]
    at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
    at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
    at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
    at org.jgroups.protocols.TP.down(TP.java:1209)
    at org.jgroups.protocols.Discovery.down(Discovery.java:576)
    at org.jgroups.protocols.FD_ALL.down(FD_ALL.java:201)
    at org.jgroups.protocols.VERIFY_SUSPECT.down(VERIFY_SUSPECT.java:80)
    at org.jgroups.protocols.BARRIER.down(BARRIER.java:94)
    at org.jgroups.protocols.pbcast.NAKACK2.send(NAKACK2.java:673)
    at org.jgroups.protocols.pbcast.NAKACK2.down(NAKACK2.java:453)
    at org.jgroups.protocols.UNICAST2.down(UNICAST2.java:523)
    at org.jgroups.protocols.RSVP.down(RSVP.java:143)
    at org.jgroups.protocols.pbcast.STABLE.down(STABLE.java:328)
    at org.jgroups.protocols.pbcast.GMS.down(GMS.java:965)
    at org.jgroups.protocols.FlowControl.down(FlowControl.java:351)
    at org.jgroups.protocols.MFC.handleDownMessage(MFC.java:116)
    at org.jgroups.protocols.FlowControl.down(FlowControl.java:341)
    at org.jgroups.protocols.FRAG2.down(FRAG2.java:147)
    at org.jgroups.protocols.pbcast.STATE_TRANSFER.down(STATE_TRANSFER.java:238)
    at org.jgroups.protocols.pbcast.FLUSH.down(FLUSH.java:312)
    at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1025)
    at org.jgroups.JChannel.down(JChannel.java:729)
    at org.jgroups.JChannel.send(JChannel.java:445)
    at org.portico.bindings.jgroups.channel.FederationChannel.send(FederationChannel.java:247)

Environment and Logs

Portico 2.1.0-beta
Any
Tim's big iMac
wantest federate 1.0.0-beta

Reproduction Steps

Start two test federates and run their throughput tests

## `./wantest.sh --federate-name one --peers two --no-latency-test --packet-size 1K --loops 10000`

Wait for the exception to pop up in one federate mid-test. May only happen sometimes.

TBC

The text was updated successfully, but these errors were encountered:

timpokorny · 2015-01-22T17:14:14Z

This looks to be because I am a moron and I replace the executor in code (FederationChannel:207). I set up a new executor, which gets the default rejection policy (Abort). This then replaces the existing, which was appropriately configured from the JGroups stack config file.

In my defence, I do this because I want the threads to be daemons, which in turn comes from a problem with the HLA API where you can't guarantee that you know when you are shutting down. In HLA v1.3 there is no disconnect call from which everything should get shutdown, and you can't do it on resign (as you still need to process the messages - like destroy or create). Need to do a better job of copying over settings when creating this new executor.

overloaded)

timpokorny added bug ready urg:med labels Jan 12, 2015

timpokorny self-assigned this Jan 12, 2015

timpokorny added this to the portico-2.1 milestone Jan 12, 2015

timpokorny mentioned this issue Jan 22, 2015

Port federate code to use the immediate callback facilities rather than evoked openlvc/hperf#5

Closed

7 tasks

timpokorny added the in progress label Jan 22, 2015

timpokorny closed this as completed in 545d3cc Jan 23, 2015

timpokorny added fix:fixed and removed in progress ready labels Jan 23, 2015

timpokorny added a commit to openlvc/hperf that referenced this issue Jan 23, 2015

Updated Portico jar file fixed for openlvc/portico#63 (crashing when

1743096

overloaded)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JGroups rejects requests to send messages when it is overloaded #63

JGroups rejects requests to send messages when it is overloaded #63

timpokorny commented Jan 12, 2015

timpokorny commented Jan 22, 2015

JGroups rejects requests to send messages when it is overloaded #63

JGroups rejects requests to send messages when it is overloaded #63

Comments

timpokorny commented Jan 12, 2015

Summary

Environment and Logs

Reproduction Steps

Start two test federates and run their throughput tests

Wait for the exception to pop up in one federate mid-test. May only happen sometimes.

TBC

timpokorny commented Jan 22, 2015