-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconnect failure under high load #191
Comments
i was able to capture the connect sequence to the nats server during a reconnect failure
the intial PING was replied by the server with A PONG and 1 second later a 3 PING's from the server every seconds before connection was closed. |
oeps was closed unwanted |
Seem that you are using the Java client, you probably should have created it in that repo, no? |
Btw transferring issues among repos in Github is in beta now: https://help.github.com/articles/transferring-an-issue-to-another-repository/ |
This may be related to another Java bug, something with pings. I will take a look next week. |
I should say "reported bug." |
i've been looking into a second issue, not reported yet because i'm still looking into it, but the solution for this problem seems to have a positive side effect that i also seem to solve/help in this problem. the second issue if found, is that when you have a high speed sender (> 3M msg/sec) and you restart the nast server , simulating a network disconnect , it gets sometimes a Out of Memory exception. replacing this queue with a LinkedBlockingQueue implementation and bound the size of the queue to +/- 2048 , did fix this problem and had as side effect i could send at a much higher speed than before, also the recovering was working again as expecting. using this same LinkedBlockingQueue on the slow consumer reader thread also seems to help in preventing errors. Messages will be dropped , but that is to be expected as sender rate > consumer rate , but the code seems stable , even with frequent restart of nats server. the patch nats library can be found in https://github.com/lucwillems/java-nats in fix-oom-sender branch i'm going to do some more testing the coming weeks because current result are from same single all on 1 server testing. |
The queue size for the sender on disconnect is controlled by the options, which should default to 64*1024. It sounds like this is not being respected, causing the overflow. I am surprised the speed is higher, i think I tried that queue and it was slower, but different machines definitely can cause different problems. Looking at the writer I can see that when I fixed a different bug, i messed up the logic for checking the buffer limit on reconnect in canQueue(). I am going to try to find time this week to work on this, I will try the other queue and see what happens, but after if fix the canQueue bug. There is an issue though. I am guessing that your fix blocks if you hit the limit. Mine would start to throw after you try to publish more than are allowed while dc-d. I wonder if I should offer a way to block instead? |
sender limit is indeed controlled by options but they didn't seem to kick in until the sender thread was blocked by the blockingQueue . there seems to be a unfairness between the fast sender thread which is basically while(true){ publish(xxx); } en the other IO threads which do the reading and reconnect. both read and reconnect threads are relative slow compared to the action the writer/sender thread must do. the blocking causes the sender thread to stop for a short time, giving other threads opportune to kick in. |
ok, finally got some time to look at this, comment about the canQueue is wrong, I had a test for that and looking the code again the limit in options should be respected. But that doesn't solve this publisher over eating the CPU problem. I will dig into that. |
For the publisher problem I am hesitant to change anything. If I change the blocking behavior in options it is opaque to a caller, but if I make another method it could create confusion. What i did do is update the doc to be very clear about the limits and the exception that is thrown when you hit it. On to the original issue with the timeout during reconnect killing it. |
ok, this was icky - it looks like the issue is that the reading code wasn't reseting properly. Which is ok most of the time, but not all the time. Under heavy load there were times when the reader got stuck thinking it had read the \r but not the \n at the end of a protocol line. I fixed that, and added a test that mocks the server to force a partial read. Which discovered that the opPos variable was also not reset on disconnect, reconnect. I verified both fixes make the test work, and only 1 or neither break under the test (which doesn't need a publisher since I fake the server protocol.) I checked the fixes with your code and got to 5 reconnects and 33M messages. I did not reach that before. Moreover, each reconnect was able to read some messages before getting dc-d and going back for more. I am going to check the fix in v.2.4.0 branch. Thanks for doing so much work, I am not sure I would have found the issue without your test code. |
lucwillems if you get a chance to try the v2.4.0 branch let me know if it fixes the problem. I am hoping to fix 1-2 more things and get this in master this week (despite the holiday) |
i will give it a go on Friday
luc
…On Tue, Nov 20, 2018 at 6:54 PM Stephen Asbury ***@***.***> wrote:
lucwillems if you get a chance to try the v2.4.0 branch let me know if it
fixes the problem. I am hoping to fix 1-2 more things and get this in
master this week (despite the holiday)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#191 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABPBR5uR0vytR96wkan6zpiXjfbdRbGSks5uxEHhgaJpZM4YYG1P>
.
--
T.M.M BVBA
Luc Willems
Schoolblok 7
2275 Lille
mobile: 0478/959140
email: luc.willems@t-m-m.be
|
Cool - i will wait until next week to release then, assuming this fix works for you. Thanks! |
result of the re-testing.
i focussed today on the OOM error , using the sender-oom branch on my natsstability git.
and checked the heapdump with jhat . at the moment of the dump, there where
each NatsMessage was 80 bytes big. The oom always seems to come in the filter during NatsConnectionWriter.stop() when filtering i also could not find anything that would block the sender thread during this phase. the reconnect thread set the outgoing queue into pause via NatsConnectionWriter.stop() , but NatsConnection.publish() doesn't take this state into account and keeps on publishing this is fine, for short bursts but in case we have sender that can not be blocked or stopped ... |
Did you set the reconnect buffer limit in the options? It is weird because I have a test that should prevent the OOM by throwing an exception if you hit the buffer limit which is in the 10s of MB i thought. |
not specific so suspect default settings would kick in |
I had a thought, is the publisher connection thrashing? The check that limits the size of the buffer only kicks in if the connection is reconnecting or disconnected. If the connection is thrashing maybe msgs are getting on the queue, then a disconnect happens, so the queue keeps growing? But I didn't think that was happing with the example. Maybe there is some sort of issue with the status check. If you are digging, the check is in publish() where I call "canQueue" on the writer. |
ok have looked deeper in to the canQueue story and the OOM exception. 12:27:34.200 [pool-1-thread-1] INFO Listener - conn: sender event=nats: connection opened i see the reconnect, but after that , the writer thread is already dead/finished and not restarted anymore. |
i was testing/reviewing performance and stability of current java library. during this test i noticed that
in some cases, a disconnected consumer (disconnected by server because of slow consumer) was not able to reconnect because of following error :
i'm using :
i have a small test application which produces this error after some time.
see https://github.com/lucwillems/nats-stability
a workaround is to call connection.close() in the error Listener class , but this has ofcourse
some other major drawback that we need to create a new Connection instance and manage all subscriptions outside the Connection class.
The text was updated successfully, but these errors were encountered: