Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconnect failure under high load #191

Closed
lucwillems opened this issue Nov 10, 2018 · 20 comments
Closed

Reconnect failure under high load #191

lucwillems opened this issue Nov 10, 2018 · 20 comments
Assignees

Comments

@lucwillems
Copy link

i was testing/reviewing performance and stability of current java library. during this test i noticed that
in some cases, a disconnected consumer (disconnected by server because of slow consumer) was not able to reconnect because of following error :

13:05:41.997 [pool-4-thread-1] ERROR Listener - conn: receiver-1
java.io.IOException: Read channel closed.
	at io.nats.client.impl.NatsConnectionReader.run(NatsConnectionReader.java:138)
	at java.lang.Thread.run(Thread.java:748)
13:05:41.997 [pool-4-thread-1] INFO Listener - conn: receiver-1  event=nats: connection disconnected
13:05:43.741 [sender-0  ] INFO Main - sender  : 551439 msg/sec CONNECTED
13:05:44.017 [pool-4-thread-1] ERROR Listener - conn: receiver-1
java.util.concurrent.TimeoutException: null
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
	at io.nats.client.impl.NatsConnection.tryToConnect(NatsConnection.java:319)
	at io.nats.client.impl.NatsConnection.reconnect(NatsConnection.java:225)
	at io.nats.client.impl.NatsConnection.closeSocket(NatsConnection.java:471)
	at io.nats.client.impl.NatsConnection.lambda$handleCommunicationIssue$2(NatsConnection.java:428)
	at java.lang.Thread.run(Thread.java:748)
  • the receiver connection sees the disconnect, and starts the reconnect
  • during reconnect we get TimeoutException , and after this , ALL reconnect attemps fail with this error.

i'm using :

  • docker nats server version 1.3.0
  • java 8 / nats library 2.3.0

i have a small test application which produces this error after some time.
see https://github.com/lucwillems/nats-stability

a workaround is to call connection.close() in the error Listener class , but this has ofcourse
some other major drawback that we need to create a new Connection instance and manage all subscriptions outside the Connection class.

@lucwillems
Copy link
Author

i was able to capture the connect sequence to the nats server during a reconnect failure
following sequence was done one the TCP connection before it was closed

SERVER: INFO {"server_id":"M1QTiqieHXvevcuRmhrROK","version":"1.3.0","proto":1,"git_commit":"eed4fbc","go":"go1.11","host":"0.0.0.0","port":4222,"max_payload":1048576,"client_id":3} 
CLIENT: CONNECT {"lang":"java","version":"2.1.1","name":"receiver-1","protocol":1,"verbose":false,"pedantic":false,"tls_required":false,"echo":true}
CLIENT: PING
SERVER: PONG
SERVER: PING
SERVER: PING
SERVER: PING

the intial PING was replied by the server with A PONG and 1 second later a 3 PING's from the server every seconds before connection was closed.

@lucwillems
Copy link
Author

oeps was closed unwanted

@kozlovic
Copy link
Member

Seem that you are using the Java client, you probably should have created it in that repo, no?

@wallyqs
Copy link
Member

wallyqs commented Nov 10, 2018

Btw transferring issues among repos in Github is in beta now: https://help.github.com/articles/transferring-an-issue-to-another-repository/

@derekcollison derekcollison transferred this issue from nats-io/nats.go Nov 10, 2018
@sasbury
Copy link
Contributor

sasbury commented Nov 11, 2018

This may be related to another Java bug, something with pings. I will take a look next week.

@sasbury
Copy link
Contributor

sasbury commented Nov 11, 2018

I should say "reported bug."

@lucwillems
Copy link
Author

i've been looking into a second issue, not reported yet because i'm still looking into it, but the solution for this problem seems to have a positive side effect that i also seem to solve/help in this problem.

the second issue if found, is that when you have a high speed sender (> 3M msg/sec) and you restart the nast server , simulating a network disconnect , it gets sometimes a Out of Memory exception.
this was cause by MessageQueue.queue being a unbound queue and holding huge amount of NatsMessages from the sender thread.

replacing this queue with a LinkedBlockingQueue implementation and bound the size of the queue to +/- 2048 , did fix this problem and had as side effect i could send at a much higher speed than before,
(from 2,5 M msg/sec to +/- 5 M msg/sec) . the flow seems more "stable" using the nats-top application.

also the recovering was working again as expecting.

using this same LinkedBlockingQueue on the slow consumer reader thread also seems to help in preventing errors. Messages will be dropped , but that is to be expected as sender rate > consumer rate , but the code seems stable , even with frequent restart of nats server.

the patch nats library can be found in https://github.com/lucwillems/java-nats in fix-oom-sender branch
the "oom-sender" on the natsstability repository has a simple example of this problem.

i'm going to do some more testing the coming weeks because current result are from same single all on 1 server testing.

@sasbury
Copy link
Contributor

sasbury commented Nov 13, 2018

The queue size for the sender on disconnect is controlled by the options, which should default to 64*1024. It sounds like this is not being respected, causing the overflow. I am surprised the speed is higher, i think I tried that queue and it was slower, but different machines definitely can cause different problems.

Looking at the writer I can see that when I fixed a different bug, i messed up the logic for checking the buffer limit on reconnect in canQueue().

I am going to try to find time this week to work on this, I will try the other queue and see what happens, but after if fix the canQueue bug.

There is an issue though. I am guessing that your fix blocks if you hit the limit. Mine would start to throw after you try to publish more than are allowed while dc-d. I wonder if I should offer a way to block instead?

@lucwillems
Copy link
Author

sender limit is indeed controlled by options but they didn't seem to kick in until the sender thread was blocked by the blockingQueue . there seems to be a unfairness between the fast sender thread which is basically while(true){ publish(xxx); } en the other IO threads which do the reading and reconnect. both read and reconnect threads are relative slow compared to the action the writer/sender thread must do.

the blocking causes the sender thread to stop for a short time, giving other threads opportune to kick in.
under heavy load, there are no guarantees which thread gets scheduled and how often or fair share of execution time. the blocking queue helps with this in case the input of the queue is faster then the consumer of the queue.

@sasbury
Copy link
Contributor

sasbury commented Nov 18, 2018

ok, finally got some time to look at this, comment about the canQueue is wrong, I had a test for that and looking the code again the limit in options should be respected. But that doesn't solve this publisher over eating the CPU problem. I will dig into that.

@sasbury
Copy link
Contributor

sasbury commented Nov 18, 2018

For the publisher problem I am hesitant to change anything. If I change the blocking behavior in options it is opaque to a caller, but if I make another method it could create confusion. What i did do is update the doc to be very clear about the limits and the exception that is thrown when you hit it.

On to the original issue with the timeout during reconnect killing it.

@sasbury
Copy link
Contributor

sasbury commented Nov 19, 2018

ok, this was icky - it looks like the issue is that the reading code wasn't reseting properly. Which is ok most of the time, but not all the time. Under heavy load there were times when the reader got stuck thinking it had read the \r but not the \n at the end of a protocol line. I fixed that, and added a test that mocks the server to force a partial read. Which discovered that the opPos variable was also not reset on disconnect, reconnect. I verified both fixes make the test work, and only 1 or neither break under the test (which doesn't need a publisher since I fake the server protocol.)

I checked the fixes with your code and got to 5 reconnects and 33M messages. I did not reach that before. Moreover, each reconnect was able to read some messages before getting dc-d and going back for more.

I am going to check the fix in v.2.4.0 branch. Thanks for doing so much work, I am not sure I would have found the issue without your test code.

@sasbury
Copy link
Contributor

sasbury commented Nov 20, 2018

lucwillems if you get a chance to try the v2.4.0 branch let me know if it fixes the problem. I am hoping to fix 1-2 more things and get this in master this week (despite the holiday)

@lucwillems
Copy link
Author

lucwillems commented Nov 21, 2018 via email

@sasbury
Copy link
Contributor

sasbury commented Nov 21, 2018

Cool - i will wait until next week to release then, assuming this fix works for you. Thanks!

@lucwillems
Copy link
Author

result of the re-testing.

  • the slow consumer disconnect/reconnect seems ok. i'm using the main branch for that with 1 sender thread and 1 "slower" reciever thread.
  • restart was still failing: error symptoms
    • receiver doesn't receive messages
    • Out of memory exception.

i focussed today on the OOM error , using the sender-oom branch on my natsstability git.
it only has 1 sender thread , and again , once running we restart single nats server.
this caused a OOM error during reconnect. i used following JVM arguments
-Xms512M -Xmx512M -XX:+HeapDumpOnOutOfMemoryError

java.lang.OutOfMemoryError: GC overhead limit exceeded
Dumping heap to java_pid10110.hprof ...
Heap dump file created [631942335 bytes in 2.855 secs]
Exception in thread "sender Reconnect" java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.concurrent.ConcurrentLinkedQueue.offer(ConcurrentLinkedQueue.java:328)
	at java.util.concurrent.ConcurrentLinkedQueue.add(ConcurrentLinkedQueue.java:297)
	at io.nats.client.impl.MessageQueue.filter(MessageQueue.java:286)
	at io.nats.client.impl.NatsConnectionWriter.stop(NatsConnectionWriter.java:80)
	at io.nats.client.impl.NatsConnection.closeSocketImpl(NatsConnection.java:568)
	at io.nats.client.impl.NatsConnection.closeSocket(NatsConnection.java:459)
	at io.nats.client.impl.NatsConnection.lambda$handleCommunicationIssue$2(NatsConnection.java:431)
	at io.nats.client.impl.NatsConnection$$Lambda$7/1618928271.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:748)
06:10:55.105 [main] ERROR Main - oeps
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.concurrent.ConcurrentLinkedQueue.offer(ConcurrentLinkedQueue.java:328)
	at java.util.concurrent.ConcurrentLinkedQueue.add(ConcurrentLinkedQueue.java:297)
	at io.nats.client.impl.MessageQueue.push(MessageQueue.java:98)
	at io.nats.client.impl.NatsConnectionWriter.queue(NatsConnectionWriter.java:174)
	at io.nats.client.impl.NatsConnection.queueOutgoing(NatsConnection.java:1112)
	at io.nats.client.impl.NatsConnection.publish(NatsConnection.java:641)
	at io.nats.client.impl.NatsConnection.publish(NatsConnection.java:608)
	at Main.main(Main.java:34)

and checked the heapdump with jhat . at the moment of the dump, there where

4017519 instances of class io.nats.client.impl.NatsMessage 
7 instances of class ch.qos.logback.classic.Level 

each NatsMessage was 80 bytes big.

The oom always seems to come in the filter during NatsConnectionWriter.stop() when filtering
for the ping/pong messages executed by the reconnect thread

i also could not find anything that would block the sender thread during this phase.

the reconnect thread set the outgoing queue into pause via NatsConnectionWriter.stop() , but NatsConnection.publish() doesn't take this state into account and keeps on publishing
via queueOutgoing() causing the queue to grow.

this is fine, for short bursts but in case we have sender that can not be blocked or stopped ...

@sasbury
Copy link
Contributor

sasbury commented Nov 26, 2018

Did you set the reconnect buffer limit in the options? It is weird because I have a test that should prevent the OOM by throwing an exception if you hit the buffer limit which is in the 10s of MB i thought.

@lucwillems
Copy link
Author

not specific so suspect default settings would kick in
but i could not found any limitation when following the publish() path until the message is queued.

@sasbury
Copy link
Contributor

sasbury commented Nov 27, 2018

I had a thought, is the publisher connection thrashing? The check that limits the size of the buffer only kicks in if the connection is reconnecting or disconnected. If the connection is thrashing maybe msgs are getting on the queue, then a disconnect happens, so the queue keeps growing? But I didn't think that was happing with the example. Maybe there is some sort of issue with the status check.

If you are digging, the check is in publish() where I call "canQueue" on the writer.

@lucwillems
Copy link
Author

ok have looked deeper in to the canQueue story and the OOM exception.
the reason that the writer queue is filling up is because of the fact that NatsConnection.status == connected, so canQueue will not be evaluated anymore AND the writer thread has finished (was not running anymore) because of IOException because of the restart/disconnect from the NATS server

12:27:34.200 [pool-1-thread-1] INFO Listener - conn: sender event=nats: connection opened
java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at io.nats.client.impl.SocketDataPort.write(SocketDataPort.java:98)
at io.nats.client.impl.NatsConnectionWriter.run(NatsConnectionWriter.java:151)
at java.lang.Thread.run(Thread.java:748)

i see the reconnect, but after that , the writer thread is already dead/finished and not restarted anymore.
so , canQueue will not block anymore and message queue is filling up.

@sasbury sasbury mentioned this issue Dec 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants