Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JGroups rejects requests to send messages when it is overloaded #63

Closed
timpokorny opened this issue Jan 12, 2015 · 1 comment
Closed

Comments

@timpokorny
Copy link
Member

Summary

Under very heavy load, Portico can crash thanks to JGroups refusing to be able to send any more messages. This happens when its sender thread pool is completely filled up. The rejection policy for tasks seems to be "reject" or something similar, when it should be "execute" so the task just gets executed on the thread submitting the job.

The following is a stack trace showing the problem when running the wantest federate (immediate callback mode - two federate federation running on the same computer - very fast send speeds).

ERROR [main] portico.lrc: org.portico.lrc.compat.JRTIinternalError: Problem sending message: channel=WAN Test Federation, error message=Task org.jgroups.protocols.TP$5@7fd751de rejected from java.util.concurrent.ThreadPoolExecutor@e162a35[Running, pool size = 10, active threads = 7, queued tasks = 995, completed tasks = 193678]
Exception in thread "main" hla.rti1516e.exceptions.RTIinternalError: org.portico.lrc.compat.JRTIinternalError: Problem sending message: channel=WAN Test Federation, error message=Task org.jgroups.protocols.TP$5@7fd751de rejected from java.util.concurrent.ThreadPoolExecutor@e162a35[Running, pool size = 10, active threads = 7, queued tasks = 995, completed tasks = 193678]
    at org.portico.impl.hla1516e.Rti1516eAmbassador.updateAttributeValues(Rti1516eAmbassador.java:2081)
    at wantest.throughput.ThroughputDriver.loop(ThroughputDriver.java:252)
    at wantest.throughput.ThroughputDriver.execute(ThroughputDriver.java:122)
    at wantest.Federate.execute(Federate.java:111)
    at wantest.Main.main(Main.java:49)
Caused by: org.portico.lrc.compat.JRTIinternalError: Problem sending message: channel=WAN Test Federation, error message=Task org.jgroups.protocols.TP$5@7fd751de rejected from java.util.concurrent.ThreadPoolExecutor@e162a35[Running, pool size = 10, active threads = 7, queued tasks = 995, completed tasks = 193678]
    at org.portico.bindings.jgroups.channel.FederationChannel.send(FederationChannel.java:252)
    at org.portico.bindings.jgroups.JGroupsConnection.broadcast(JGroupsConnection.java:153)
    at org.portico.lrc.services.object.handlers.outgoing.UpdateAttributesHandler.process(UpdateAttributesHandler.java:105)
    at org.portico.utils.messaging.MessageSink.process(MessageSink.java:187)
    at org.portico.impl.hla1516e.Impl1516eHelper.processMessage(Impl1516eHelper.java:99)
    at org.portico.impl.hla1516e.Rti1516eAmbassador.processMessage(Rti1516eAmbassador.java:5554)
    at org.portico.impl.hla1516e.Rti1516eAmbassador.updateAttributeValues(Rti1516eAmbassador.java:2063)
    ... 4 more
Caused by: java.util.concurrent.RejectedExecutionException: Task org.jgroups.protocols.TP$5@7fd751de rejected from java.util.concurrent.ThreadPoolExecutor@e162a35[Running, pool size = 10, active threads = 7, queued tasks = 995, completed tasks = 193678]
    at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
    at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
    at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
    at org.jgroups.protocols.TP.down(TP.java:1209)
    at org.jgroups.protocols.Discovery.down(Discovery.java:576)
    at org.jgroups.protocols.FD_ALL.down(FD_ALL.java:201)
    at org.jgroups.protocols.VERIFY_SUSPECT.down(VERIFY_SUSPECT.java:80)
    at org.jgroups.protocols.BARRIER.down(BARRIER.java:94)
    at org.jgroups.protocols.pbcast.NAKACK2.send(NAKACK2.java:673)
    at org.jgroups.protocols.pbcast.NAKACK2.down(NAKACK2.java:453)
    at org.jgroups.protocols.UNICAST2.down(UNICAST2.java:523)
    at org.jgroups.protocols.RSVP.down(RSVP.java:143)
    at org.jgroups.protocols.pbcast.STABLE.down(STABLE.java:328)
    at org.jgroups.protocols.pbcast.GMS.down(GMS.java:965)
    at org.jgroups.protocols.FlowControl.down(FlowControl.java:351)
    at org.jgroups.protocols.MFC.handleDownMessage(MFC.java:116)
    at org.jgroups.protocols.FlowControl.down(FlowControl.java:341)
    at org.jgroups.protocols.FRAG2.down(FRAG2.java:147)
    at org.jgroups.protocols.pbcast.STATE_TRANSFER.down(STATE_TRANSFER.java:238)
    at org.jgroups.protocols.pbcast.FLUSH.down(FLUSH.java:312)
    at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1025)
    at org.jgroups.JChannel.down(JChannel.java:729)
    at org.jgroups.JChannel.send(JChannel.java:445)
    at org.portico.bindings.jgroups.channel.FederationChannel.send(FederationChannel.java:247)

Environment and Logs

  • Portico 2.1.0-beta
  • Any
  • Tim's big iMac
  • wantest federate 1.0.0-beta

Reproduction Steps

Start two test federates and run their throughput tests

## `./wantest.sh --federate-name one --peers two --no-latency-test --packet-size 1K --loops 10000`

Wait for the exception to pop up in one federate mid-test. May only happen sometimes.

TBC

@timpokorny
Copy link
Member Author

This looks to be because I am a moron and I replace the executor in code (FederationChannel:207). I set up a new executor, which gets the default rejection policy (Abort). This then replaces the existing, which was appropriately configured from the JGroups stack config file.

In my defence, I do this because I want the threads to be daemons, which in turn comes from a problem with the HLA API where you can't guarantee that you know when you are shutting down. In HLA v1.3 there is no disconnect call from which everything should get shutdown, and you can't do it on resign (as you still need to process the messages - like destroy or create). Need to do a better job of copying over settings when creating this new executor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant