Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket Errors after Idle #145

Closed
roadbikemike opened this issue Feb 14, 2017 · 13 comments
Closed

Socket Errors after Idle #145

roadbikemike opened this issue Feb 14, 2017 · 13 comments

Comments

@roadbikemike
Copy link

roadbikemike commented Feb 14, 2017

When we are running load tests against Neo using the .NET 1.1.1 driver, there are no issues. We can push thousands of requests through as expected. However, if we let our application sit idle and then try a connect, we get the following below. In this case, I let Neo and the client application sit over night with no activity. Yesterday, before leaving everything was working great, but throwing these errors this morning. After a restart of the client side application, things were again working as intended. It seems as if the driver is leaving something open and expecting the socket to be alive when it is not.

Client Side:
2017-02-14 13:17:08,584 [18] ERROR NeoDrivergFileAppender - Unable to unpack message from server, connection has been terminated.
System.IO.IOException: Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. ---> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags)
at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
--- End of inner exception stack trace ---
at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
at Neo4j.Driver.Internal.Connector.SocketExtensions.Read(Stream stream, Byte[] bytes)
at Neo4j.Driver.Internal.Connector.ChunkedInputStream.ReadSpecifiedSize(Byte[] buffer)
at Neo4j.Driver.Internal.Connector.ChunkedInputStream.Ensure(Int32 size)
at Neo4j.Driver.Internal.Connector.ChunkedInputStream.ReadByte()
at Neo4j.Driver.Internal.Packstream.PackStream.Unpacker.UnpackStructHeader()
at Neo4j.Driver.Internal.Packstream.PackStreamMessageFormatV1.ReaderV1.Read(IMessageResponseHandler responseHandler)
at Neo4j.Driver.Internal.Connector.SocketClient.ReceiveOne(IMessageResponseHandler responseHandler)

Server Side:
2017-02-14 13:16:59.567+0000 ERROR [o.n.b.t.SocketTransportHandler] Fatal error occurred when handling a client connection: Connection reset by peer Connection reset by peer
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1100)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:366)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:118)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:574)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:488)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873)
at java.lang.Thread.run(Thread.java:745)

It takes the Neo driver exactly 18 seconds before the client side error is generated, which is odd. It is exactly 18 seconds every time. Again, it seems as if the driver never closes a connection and is expecting it to be open when it clearly is not. This is simply a theory though.

@zhenlineo
Copy link
Contributor

Hi, the client maintains a connection pool and reuses connections from the pool. If you do not want any connection being pooled, you could turn it off (The default pooled connection size is 10). Then all the connections with the server will be new connections. While this might slow down your app as socket connections are expensive to establish and close. We currently have a experimental function added in java-driver to test connectivity after certain timeout before borrow the connection from the pool. We might introduce to .net driver too to avoid this problem.

So the behaviour you got is expected, and you might need to introduce some retries now to avoid this problem.

@roadbikemike
Copy link
Author

roadbikemike commented Feb 14, 2017

Hmmm. Does not sound ideal. The only way I could get a connection to Neo once the client was in this state was to restart the web app. It seems as if these connections in the pool should definitely expire after some amount of time. Having to restart the application in order to recover is not a suitable solution. Do you have any idea in regards to the connectivity test?

I re-tested a moment ago and a retry did reconnect without a restart. My assumption is the pool was full from yesterdays testing and no new connections could be added (via retry). Any idea what causes the 18 seconds. That is a really long time to wait before retrying...

@zhenlineo
Copy link
Contributor

No, you should try

  1. add retries so that when you query failed due to this connection error, you could replay. If you retry more than connection pool size, then you should be able recover from this.
  2. do a background ping such as RETURN 1 in your app to keep the connections alive.

@roadbikemike
Copy link
Author

It appears once the connection pool is full, all connections are essentially dead after some time. Meaning there is no way to obtain a new, valid connection without destroying the pool and starting over. That is why restarting the application addressed the issue.

When I tested this morning, the first connection to Neo failed (after 18 seconds), but the second was successful as it likely picked up another connection since the pool was not full. Does this mean I may have a pool of stale connections and connections that are healthy? Is there a chance I may get in to another 18 second timeout if the app selects a stale connection? Having a way for the connections to expire in the pool if not used for some time-frame would be ideal.

Something we are exploring is setting a TTL on the GraphDatabase.Driver object. Once some period has been reached, throw it away and create new. This should destroy the pool and start fresh.

@zhenlineo
Copy link
Contributor

You do not have to destroy the pool to get a valid conn. The driver will prefer first try to grab a conn from the pool, so that's why you get broken connection (If a conn is idle too long, the conn get killed by your system. The 18s might be the timeout used by your local system). But broken conn will not be returned to the pool. So if you try more than pool size, then you will ensure to get a new one.

While if you always has to wait for 18s, then recreate a driver object might be a easier way for you.

@roadbikemike
Copy link
Author

I see this exact same issue was identified and fixed in the Java driver. Could you perhaps provide some information as to when we will see this in the .NET driver. At this point in time, this is a big issue for us. As usual, thanks zhenlineo!

@zhenlineo
Copy link
Contributor

Hi @krushmike

Yes, there is a "fix" in java driver, the reason that it does not make into other drivers is not because other drivers/driver users are not important, it mainly because it is not a good solution.

So as you could imagine, there are many ways to fix it, the one we did in java is not really ideal. I could give you the same thing in java driver in a patch release, while I do not think your problem will be solved. As you mentioned that it takes you 18s to receive a connection failure. If I just simply add what java-driver has, then you might end up to wait for 18s*poolSize for the first query you do when you come back after idling. I do not think it is the fix we are aiming for.

I need to understand why it takes 18s to fail, and we are also trying to see if there is anything better if we cannot avoid this 18s in some systems. So we are currently planing to fix it in 1.2 or 1.3 driver timeline. (I already branched 1.2 and start to look into connectionPool) While the final release of 1.2 or 1.3 might be a few months from now. So I would first suggest to try some work around in system/application level to do a keep alive ping or just recreate a driver after your system socket timeout reached.

@roadbikemike
Copy link
Author

We are running our apps in a PaaS environment, so we do not have access to any system level tuning such as keep-alive, TCP timeouts, etc. Implementing a keep alive ping in the application itself may be the best solution for the interim. While not exactly related, here is a good article on how MS manages the connection pool in their ADO driver. Perhaps some ideas which you could implement in the Neo driver:

https://msdn.microsoft.com/en-us/library/8xx3tyca%28v=vs.110%29.aspx?f=255&MSPPError=-2147217396

This is the most notable:

"The connection pooler removes a connection from the pool after it has been idle for approximately 4-8 minutes, or if the pooler detects that the connection with the server has been severed"

One question. If we implement a ping of some sort, does that guarantee this will keep all 50 connections alive in our pool?

@zhenlineo
Copy link
Contributor

zhenlineo commented Feb 15, 2017

Thanks so much for providing this related pool impl. It provides some insights to us for sure.

Could you also inspect on the connections between the client and server to see what traffic is happening during the "18s" time between the client and server after a long idle time?

According to the stack trace you provided, this line at Neo4j.Driver.Internal.Connector.SocketClient.ReceiveOne(IMessageResponseHandler responseHandler) indicates that the driver already sent something to the server, while the client failed to read a response from the server in certain timeout (maybe 18s)

Cheers

@zhenlineo
Copy link
Contributor

zhenlineo commented Feb 15, 2017

The connections in the connection pool is a queue, so if you ping 50 times, then you are sure to pinged everyone. But I do not suggest you keep the idle pool 50 unless you hit some performance issues. Otherwise, use some small numbers such as 10? Just let you know that the idle pool size does not limit max pool size you could get in your APP.

@roadbikemike
Copy link
Author

To note, the 18 seconds is derived from a Windows setting which only allows 5 TCP re-transmissions before terminating the connection. Each connection doubles in time from the previous. This ultimately adds up to the 18 seconds.

neo_timeout

@zhenlineo
Copy link
Contributor

Hi, I am closing this as 1.2.2 release already addressed this issue. Pls be free to reopen this if you still have any problems.

Cheers.

@technige
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants