Refactor the reactor connection, write and read logic #274
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We have a couple of problematic parts in the core of the reactor.
The first one is that we run periodic tasks that may open up connections,
such as the timer in the connection manager that tries to open
connections to members in the memberlist. In the timer, we do not
wait for authentication, etc. so it is not blocking. But if the address
we try to connect is unreachable(for example, in case of the pod restarts in k8s)
socket.connect call(which we do in the constructor of the AsyncoreConnection)
will block as long as the connection timeout which is 5s by default.
That means that, if you have 3 members in the memberlist, and they all become
unreachable to the client, reactor thread may be blocked for 3 * 5 = 15s
in the timer, while waiting for the socket.connect to return.
To overcome this, we now connect to sockets non-blockingly (with a 0s timeout)
and spawn a timer that will run after connection timeout seconds, that will
close the connection, if it is not connected at that time. If a problem occurs
before the timer, we close the connection and cancel the timer.
Apart from that, we also had a problem with the handle_read/write logic.
After some research, I saw that non-blocking sockets may throw
EAGAIN and EWOULDBLOCK errors, which should be retried. Also, if the SSL is
enabled, it may throw SSL_ERROR_WANT_WRITE/READ. See
https://www.openssl.org/docs/man1.1.1/man3/SSL_get_error.html.
Formerly, if the handle_write/read throws EAGAIN or EDEADLK, we were not
closing the connection, but also not retrying the write correctly. For example,
we pop a message from the write queue, try to send it, get EAGAIN, go to handle_error,
simply ignore the error and go on. But, the popped message was never appended back
to the write queue. Therefore, the error handling logic is moved inside the
handle_write/read, and proper message retry is implemented.
Regarding the EDEADLK, I couldn't find a proper reason to retry it. It was
added in
72c6537.
I only saw this notable library doing it,
https://github.com/docker/docker-py/blob/master/docker/utils/socket.py#L28
not sure about it, but to not change the client's behavior on some edge cases,
I am adding it to the retryable exceptions.