Skip to content

Conversation

@mdumandag
Copy link
Contributor

We have a couple of problematic parts in the core of the reactor.

The first one is that we run periodic tasks that may open up connections,
such as the timer in the connection manager that tries to open
connections to members in the memberlist. In the timer, we do not
wait for authentication, etc. so it is not blocking. But if the address
we try to connect is unreachable(for example, in case of the pod restarts in k8s)
socket.connect call(which we do in the constructor of the AsyncoreConnection)
will block as long as the connection timeout which is 5s by default.
That means that, if you have 3 members in the memberlist, and they all become
unreachable to the client, reactor thread may be blocked for 3 * 5 = 15s
in the timer, while waiting for the socket.connect to return.

To overcome this, we now connect to sockets non-blockingly (with a 0s timeout)
and spawn a timer that will run after connection timeout seconds, that will
close the connection, if it is not connected at that time. If a problem occurs
before the timer, we close the connection and cancel the timer.

Apart from that, we also had a problem with the handle_read/write logic.
After some research, I saw that non-blocking sockets may throw
EAGAIN and EWOULDBLOCK errors, which should be retried. Also, if the SSL is
enabled, it may throw SSL_ERROR_WANT_WRITE/READ. See
https://www.openssl.org/docs/man1.1.1/man3/SSL_get_error.html.

Formerly, if the handle_write/read throws EAGAIN or EDEADLK, we were not
closing the connection, but also not retrying the write correctly. For example,
we pop a message from the write queue, try to send it, get EAGAIN, go to handle_error,
simply ignore the error and go on. But, the popped message was never appended back
to the write queue. Therefore, the error handling logic is moved inside the
handle_write/read, and proper message retry is implemented.

Regarding the EDEADLK, I couldn't find a proper reason to retry it. It was
added in
72c6537.
I only saw this notable library doing it,
https://github.com/docker/docker-py/blob/master/docker/utils/socket.py#L28
not sure about it, but to not change the client's behavior on some edge cases,
I am adding it to the retryable exceptions.

@mdumandag mdumandag added this to the 4.0 milestone Dec 23, 2020
@mdumandag mdumandag self-assigned this Dec 23, 2020
Copy link

@puzpuzpuz puzpuzpuz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catches!

We have a couple of problematic parts in the core of the reactor.

The first one is that we run periodic tasks that may open up connections,
such as the timer in the connection manager that tries to open
connections to members in the memberlist. In the timer, we do not
wait for authentication, etc. so it is not blocking. But if the address
we try to connect is unreachable(for example, in case of the pod restarts in k8s)
socket.connect call(which we do in the constructor of the AsyncoreConnection)
will block as long as the connection timeout which is 5s by default.
That means that, if you have 3 members in the memberlist, and they all become
unreachable to the client, reactor thread may be blocked for 3 * 5 = 15s
in the timer, while waiting for the socket.connect to return.

To overcome this, we now connect to sockets non-blockingly (with a 0s timeout)
and spawn a timer that will run after connection timeout seconds, that will
close the connection, if it is not connected at that time. If a problem occurs
before the timer, we close the connection and cancel the timer.

Apart from that, we also had a problem with the handle_read/write logic.
After some research, I saw that non-blocking sockets may throw
EAGAIN and EWOULDBLOCK errors, which should be retried. Also, if the SSL is
enabled, it may throw SSL_ERROR_WANT_WRITE/READ. See
https://www.openssl.org/docs/man1.1.1/man3/SSL_get_error.html.

Formerly, if the handle_write/read throws EAGAIN or EDEADLK, we were not
closing the connection, but also not retrying the write correctly. For example,
we pop a message from the write queue, try to send it, get EAGAIN, go to handle_error,
simply ignore the error and go on. But, the popped message was never appended back
to the write queue. Therefore, the error handling logic is moved inside the
handle_write/read, and proper message retry is implemented.

Regarding the EDEADLK, I couldn't find a proper reason to retry it. It was
added in
hazelcast@72c6537.
I only saw this notable library doing it,
https://github.com/docker/docker-py/blob/master/docker/utils/socket.py#L28
not sure about it, but to not change the client's behavior on some edge cases,
I am adding it to the retryable exceptions.
@mdumandag mdumandag force-pushed the non-blocking-connect branch from c1cf3a0 to c2c4e93 Compare December 24, 2020 07:58
@mdumandag
Copy link
Contributor Author

Thanks a lot for the review Andrey

@mdumandag mdumandag merged commit f526367 into hazelcast:master Dec 24, 2020
@mdumandag mdumandag deleted the non-blocking-connect branch December 24, 2020 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants