Refactor the reactor connection, write and read logic #274

mdumandag · 2020-12-23T12:11:03Z

We have a couple of problematic parts in the core of the reactor.

The first one is that we run periodic tasks that may open up connections,
such as the timer in the connection manager that tries to open
connections to members in the memberlist. In the timer, we do not
wait for authentication, etc. so it is not blocking. But if the address
we try to connect is unreachable(for example, in case of the pod restarts in k8s)
socket.connect call(which we do in the constructor of the AsyncoreConnection)
will block as long as the connection timeout which is 5s by default.
That means that, if you have 3 members in the memberlist, and they all become
unreachable to the client, reactor thread may be blocked for 3 * 5 = 15s
in the timer, while waiting for the socket.connect to return.

To overcome this, we now connect to sockets non-blockingly (with a 0s timeout)
and spawn a timer that will run after connection timeout seconds, that will
close the connection, if it is not connected at that time. If a problem occurs
before the timer, we close the connection and cancel the timer.

Apart from that, we also had a problem with the handle_read/write logic.
After some research, I saw that non-blocking sockets may throw
EAGAIN and EWOULDBLOCK errors, which should be retried. Also, if the SSL is
enabled, it may throw SSL_ERROR_WANT_WRITE/READ. See
https://www.openssl.org/docs/man1.1.1/man3/SSL_get_error.html.

Formerly, if the handle_write/read throws EAGAIN or EDEADLK, we were not
closing the connection, but also not retrying the write correctly. For example,
we pop a message from the write queue, try to send it, get EAGAIN, go to handle_error,
simply ignore the error and go on. But, the popped message was never appended back
to the write queue. Therefore, the error handling logic is moved inside the
handle_write/read, and proper message retry is implemented.

Regarding the EDEADLK, I couldn't find a proper reason to retry it. It was
added in
72c6537.
I only saw this notable library doing it,
https://github.com/docker/docker-py/blob/master/docker/utils/socket.py#L28
not sure about it, but to not change the client's behavior on some edge cases,
I am adding it to the retryable exceptions.

puzpuzpuz

Nice catches!

hazelcast/discovery.py

hazelcast/reactor.py

We have a couple of problematic parts in the core of the reactor. The first one is that we run periodic tasks that may open up connections, such as the timer in the connection manager that tries to open connections to members in the memberlist. In the timer, we do not wait for authentication, etc. so it is not blocking. But if the address we try to connect is unreachable(for example, in case of the pod restarts in k8s) socket.connect call(which we do in the constructor of the AsyncoreConnection) will block as long as the connection timeout which is 5s by default. That means that, if you have 3 members in the memberlist, and they all become unreachable to the client, reactor thread may be blocked for 3 * 5 = 15s in the timer, while waiting for the socket.connect to return. To overcome this, we now connect to sockets non-blockingly (with a 0s timeout) and spawn a timer that will run after connection timeout seconds, that will close the connection, if it is not connected at that time. If a problem occurs before the timer, we close the connection and cancel the timer. Apart from that, we also had a problem with the handle_read/write logic. After some research, I saw that non-blocking sockets may throw EAGAIN and EWOULDBLOCK errors, which should be retried. Also, if the SSL is enabled, it may throw SSL_ERROR_WANT_WRITE/READ. See https://www.openssl.org/docs/man1.1.1/man3/SSL_get_error.html. Formerly, if the handle_write/read throws EAGAIN or EDEADLK, we were not closing the connection, but also not retrying the write correctly. For example, we pop a message from the write queue, try to send it, get EAGAIN, go to handle_error, simply ignore the error and go on. But, the popped message was never appended back to the write queue. Therefore, the error handling logic is moved inside the handle_write/read, and proper message retry is implemented. Regarding the EDEADLK, I couldn't find a proper reason to retry it. It was added in hazelcast@72c6537. I only saw this notable library doing it, https://github.com/docker/docker-py/blob/master/docker/utils/socket.py#L28 not sure about it, but to not change the client's behavior on some edge cases, I am adding it to the retryable exceptions.

mdumandag · 2020-12-24T08:10:20Z

Thanks a lot for the review Andrey

mdumandag added Type: Defect Source: Internal labels Dec 23, 2020

mdumandag added this to the 4.0 milestone Dec 23, 2020

mdumandag self-assigned this Dec 23, 2020

puzpuzpuz approved these changes Dec 24, 2020

View reviewed changes

hazelcast/discovery.py Show resolved Hide resolved

hazelcast/reactor.py Outdated Show resolved Hide resolved

hazelcast/reactor.py Outdated Show resolved Hide resolved

mdumandag added 4 commits December 24, 2020 10:56

move ssl tests to ssl_tests directory to fix module resolution problem

236cdde

set socket timeout to 0 explicitly

0ded478

address review comments

c2c4e93

mdumandag force-pushed the non-blocking-connect branch from c1cf3a0 to c2c4e93 Compare December 24, 2020 07:58

mdumandag merged commit f526367 into hazelcast:master Dec 24, 2020

mdumandag deleted the non-blocking-connect branch December 24, 2020 08:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor the reactor connection, write and read logic #274

Refactor the reactor connection, write and read logic #274

Uh oh!

mdumandag commented Dec 23, 2020

Uh oh!

puzpuzpuz left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mdumandag commented Dec 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor the reactor connection, write and read logic #274

Refactor the reactor connection, write and read logic #274

Uh oh!

Conversation

mdumandag commented Dec 23, 2020

Uh oh!

puzpuzpuz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mdumandag commented Dec 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants