Skip to content

Conversation

yuanmwang-wf
Copy link
Contributor

@yuanmwang-wf yuanmwang-wf commented Apr 3, 2019

We've noticed a small percentage of clients using jnats 2.4.3 (roughly 1-2%) were not able to reconnect when nats server restarted - they either seemed to hang or were stuck in a bad loop where a Socket closed exception kept being thrown every two seconds followed by a disconnected message.

After looking into the issue we found the clients that seemed to hang were stuck at https://github.com/yuanmwang-wf/java-nats/blob/95a229a38eab802491593808f98501053deb05aa/src/main/java/io/nats/client/impl/NatsConnection.java#L1061, which blocks the reconnect thread. I've added an executor service which enforces the connection timeout for readInitialInfo, checkVersionRequirements and upgradeToSecureIfNeeded, if these operations could not finish before connection timeout, an exception is thrown which will then close the socket and the reconnect thread will be allowed to try again.

Another issue we found seemed to be related to how IOException is saved to exceptionDuringConnectChange in NatsConnection. If an IOException is thrown during closeSocket while reconnecting here, that exception will be saved to exceptionDuringConnectChange through handleCommunicationIssue in NatsConnectionReader. That exception is not cleaned up and the next reconnect attempt will fail even if the socket is created correctly and tls upgrade succeeded because exceptionDuringConnectChange is not null: https://github.com/yuanmwang-wf/java-nats/blob/95a229a38eab802491593808f98501053deb05aa/src/main/java/io/nats/client/impl/NatsConnection.java#L362, this could cause the reconnect thread to be stuck in a loop if the same exception is thrown again in closeSocketImpl(). I've added this line to reset exceptionDuringConnectChange to null.

I've been testing the reconnection story in jnats pretty extensively recently and with these changes all connections were able to reconnect in my test cases. Please let me know if you think there's a better approach. @sasbury

@sasbury
Copy link
Contributor

sasbury commented Apr 3, 2019

don't worry about travis, it isn't working for PRs right now. I will take a look at this next week, I am on a business trip so my brain isn't working well.

this.callbackRunner = Executors.newSingleThreadExecutor();

this.executor = options.getExecutor();
this.connectExecutor = Executors.newSingleThreadExecutor();
Copy link

@kevinfrommelt-wf kevinfrommelt-wf Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sasbury
Copy link
Contributor

sasbury commented Apr 5, 2019

i am going to change the PR to the 2.4.3 branch and pull this in so i can play with it. The error reset is a great catch, and the forced timeout seems like a good idea.

@sasbury sasbury changed the base branch from master to 2.4.3 April 5, 2019 17:44
@sasbury sasbury merged commit ff5b3e5 into nats-io:2.4.3 Apr 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants