You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Under very heavy load, Portico can crash thanks to JGroups refusing to be able to send any more messages. This happens when its sender thread pool is completely filled up. The rejection policy for tasks seems to be "reject" or something similar, when it should be "execute" so the task just gets executed on the thread submitting the job.
The following is a stack trace showing the problem when running the wantest federate (immediate callback mode - two federate federation running on the same computer - very fast send speeds).
ERROR [main] portico.lrc: org.portico.lrc.compat.JRTIinternalError: Problem sending message: channel=WAN Test Federation, error message=Task org.jgroups.protocols.TP$5@7fd751de rejected from java.util.concurrent.ThreadPoolExecutor@e162a35[Running, pool size = 10, active threads = 7, queued tasks = 995, completed tasks = 193678]
Exception in thread "main" hla.rti1516e.exceptions.RTIinternalError: org.portico.lrc.compat.JRTIinternalError: Problem sending message: channel=WAN Test Federation, error message=Task org.jgroups.protocols.TP$5@7fd751de rejected from java.util.concurrent.ThreadPoolExecutor@e162a35[Running, pool size = 10, active threads = 7, queued tasks = 995, completed tasks = 193678]
at org.portico.impl.hla1516e.Rti1516eAmbassador.updateAttributeValues(Rti1516eAmbassador.java:2081)
at wantest.throughput.ThroughputDriver.loop(ThroughputDriver.java:252)
at wantest.throughput.ThroughputDriver.execute(ThroughputDriver.java:122)
at wantest.Federate.execute(Federate.java:111)
at wantest.Main.main(Main.java:49)
Caused by: org.portico.lrc.compat.JRTIinternalError: Problem sending message: channel=WAN Test Federation, error message=Task org.jgroups.protocols.TP$5@7fd751de rejected from java.util.concurrent.ThreadPoolExecutor@e162a35[Running, pool size = 10, active threads = 7, queued tasks = 995, completed tasks = 193678]
at org.portico.bindings.jgroups.channel.FederationChannel.send(FederationChannel.java:252)
at org.portico.bindings.jgroups.JGroupsConnection.broadcast(JGroupsConnection.java:153)
at org.portico.lrc.services.object.handlers.outgoing.UpdateAttributesHandler.process(UpdateAttributesHandler.java:105)
at org.portico.utils.messaging.MessageSink.process(MessageSink.java:187)
at org.portico.impl.hla1516e.Impl1516eHelper.processMessage(Impl1516eHelper.java:99)
at org.portico.impl.hla1516e.Rti1516eAmbassador.processMessage(Rti1516eAmbassador.java:5554)
at org.portico.impl.hla1516e.Rti1516eAmbassador.updateAttributeValues(Rti1516eAmbassador.java:2063)
... 4 more
Caused by: java.util.concurrent.RejectedExecutionException: Task org.jgroups.protocols.TP$5@7fd751de rejected from java.util.concurrent.ThreadPoolExecutor@e162a35[Running, pool size = 10, active threads = 7, queued tasks = 995, completed tasks = 193678]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at org.jgroups.protocols.TP.down(TP.java:1209)
at org.jgroups.protocols.Discovery.down(Discovery.java:576)
at org.jgroups.protocols.FD_ALL.down(FD_ALL.java:201)
at org.jgroups.protocols.VERIFY_SUSPECT.down(VERIFY_SUSPECT.java:80)
at org.jgroups.protocols.BARRIER.down(BARRIER.java:94)
at org.jgroups.protocols.pbcast.NAKACK2.send(NAKACK2.java:673)
at org.jgroups.protocols.pbcast.NAKACK2.down(NAKACK2.java:453)
at org.jgroups.protocols.UNICAST2.down(UNICAST2.java:523)
at org.jgroups.protocols.RSVP.down(RSVP.java:143)
at org.jgroups.protocols.pbcast.STABLE.down(STABLE.java:328)
at org.jgroups.protocols.pbcast.GMS.down(GMS.java:965)
at org.jgroups.protocols.FlowControl.down(FlowControl.java:351)
at org.jgroups.protocols.MFC.handleDownMessage(MFC.java:116)
at org.jgroups.protocols.FlowControl.down(FlowControl.java:341)
at org.jgroups.protocols.FRAG2.down(FRAG2.java:147)
at org.jgroups.protocols.pbcast.STATE_TRANSFER.down(STATE_TRANSFER.java:238)
at org.jgroups.protocols.pbcast.FLUSH.down(FLUSH.java:312)
at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1025)
at org.jgroups.JChannel.down(JChannel.java:729)
at org.jgroups.JChannel.send(JChannel.java:445)
at org.portico.bindings.jgroups.channel.FederationChannel.send(FederationChannel.java:247)
Environment and Logs
Portico 2.1.0-beta
Any
Tim's big iMac
wantest federate 1.0.0-beta
Reproduction Steps
Start two test federates and run their throughput tests
## `./wantest.sh --federate-name one --peers two --no-latency-test --packet-size 1K --loops 10000`
Wait for the exception to pop up in one federate mid-test. May only happen sometimes.
TBC
The text was updated successfully, but these errors were encountered:
This looks to be because I am a moron and I replace the executor in code (FederationChannel:207). I set up a new executor, which gets the default rejection policy (Abort). This then replaces the existing, which was appropriately configured from the JGroups stack config file.
In my defence, I do this because I want the threads to be daemons, which in turn comes from a problem with the HLA API where you can't guarantee that you know when you are shutting down. In HLA v1.3 there is no disconnect call from which everything should get shutdown, and you can't do it on resign (as you still need to process the messages - like destroy or create). Need to do a better job of copying over settings when creating this new executor.
Summary
Under very heavy load, Portico can crash thanks to JGroups refusing to be able to send any more messages. This happens when its sender thread pool is completely filled up. The rejection policy for tasks seems to be "reject" or something similar, when it should be "execute" so the task just gets executed on the thread submitting the job.
The following is a stack trace showing the problem when running the wantest federate (immediate callback mode - two federate federation running on the same computer - very fast send speeds).
Environment and Logs
Reproduction Steps
Start two test federates and run their throughput tests
Wait for the exception to pop up in one federate mid-test. May only happen sometimes.
TBC
The text was updated successfully, but these errors were encountered: