write starvation fixes #545

wjps · 2015-04-12T09:20:37Z

Hi, this contains some changes to the way writes are sent in pika that addresses some issues we were seeing when using the tornado adapter. There's more detail in the individual commit messages but the basic premise is that we should get the data out on the wire as quickly as possible. Unfortunately the use sendall() and timeouts isn't really suited to that way of working so the socket has been changed to non-blocking (after the connect sequence) and we simply handle the write errors when the socket buffer is full.

The changes seem to be generally applicable to other connection adapters and in fact there seems to be a bug with BlockingConnections on master that is fixed by applying these changes.

I've load tested the various adapters and In general I see much more consistent and reliable behaviour now (no random timeouts, pipeline stalls etc).

Cheers,

Will

By default (in base_connection) Pika buffers all writes and only attempts to send the next time it drops into the ioloop and detects the socket as writable. This causes some odd behaviour in a number of cases. 1. A process generating large numbers of messages will not actually send them until it finishes the processing and drops into the ioloop. 2. A process that is consuming a large queue will only send messages when it's read buffer is empty. If the messages are small this means it may end up consuming 000s of messages for every one it manages to publish. This behaviour stalls pipelines of processes. This patch tries to send the data on the socket as soon as it's generated and to avoid timeouts that might be generated by the use of sendall() it uses send() and handles partial writes by requeing the data.

Move the _handle_write changes from tornado_connection to base_connection. They're equally applicable to all connections and this fixes a bug in master where BlockingConnections wouldn't send messages until close was called.

coveralls · 2015-04-12T09:24:53Z

Coverage decreased (-0.15%) to 72.47% when pulling 576c1f0 on wjps:master into 409670b on pika:master.

vitaly-krugl · 2015-04-13T04:44:54Z

pika/adapters/base_connection.py

+                bytes_written += bw
+
+                if frame:
+                    LOGGER.warning("Partial write, requeing remaining data")


@wjps, a partial write in this scenario is not abnormal at all, so let's not log it as a warning. debug level would be more appropriate, if it's necessary to log it at all.

Yup, fair enough.

vitaly-krugl · 2015-04-13T06:36:56Z

If I am not mistaken, the proposed change in base_connection.py might fix #538 as well.

vitaly-krugl · 2015-04-13T07:23:17Z

pika/adapters/base_connection.py

+        except socket.timeout:
+            # Will only come here if the socket is blocking
+            LOGGER.warning("socket timeout, requeuing frame")
+            self.outbound_buffer.appendleft(frame)


Previously, socket.timeout was re-raised in both _handle_read() and _handle_write(). This PR makes them inconsistent: re-raised in _handle_read(), but suppressed in _handle_write().

If socket.timeout exception is going to be suppressed in _handle_write(), it necessitates clean-up in BlockingConnection._flush_outbound(), which still expects to handle socket.timeout from BaseConnection._handle_write(), and BlockingConnection._flush_outbound's handling of socket.timeout becomes dead code with this change.

Then there are also BlockingConnection._handle_timeout() and BlockingConnection._socket_timeouts logic bits that need to be considered. I think that some of that logic becomes inconsistent as the result of this PR.

Finally, this changes the semantics of pika.connection.Properties.socket_timeout; at least as far as BlockingConnection is concerned.

Personally, I like the proposed change in _handle_write(), but I also wonder what was pika author's intention behind pika.connection.Properties.socket_timeout and the corresponding handling and re-raising of the socket.timeout exceptions in the original read and write paths. Was pika's author perhaps concerned about stalled connections, relying on socket.timeout exceptions to detect stalls and abort stalled connections? @gmr, would you mind chiming in on this? Thx!

coveralls · 2015-04-13T10:14:17Z

Coverage decreased (-3.47%) to 69.15% when pulling f2dc430 on wjps:master into 409670b on pika:master.

wjps · 2015-04-13T10:21:49Z

Yup, it will fix #538 and various other intermittent failures that people who publish large volumes of messages will see. The existing code was basically giving up on a socket.timeout on write when this is actually very possible if you're generating messages faster than the network can send.

gmr · 2015-04-13T13:54:40Z

I'll take a look at this ASAP.

remove unused import, dodgy error handling

coveralls · 2015-04-13T17:02:37Z

Coverage decreased (-0.52%) to 72.1% when pulling 7b474bb on wjps:master into 409670b on pika:master.

coveralls · 2015-04-13T17:02:38Z

Coverage decreased (-0.52%) to 72.1% when pulling 7b474bb on wjps:master into 409670b on pika:master.

coveralls · 2015-04-13T20:32:02Z

Coverage decreased (-0.08%) to 72.54% when pulling 7d457f5 on wjps:master into 409670b on pika:master.

coveralls · 2015-04-13T20:32:02Z

Coverage decreased (-0.08%) to 72.54% when pulling 7d457f5 on wjps:master into 409670b on pika:master.

coveralls · 2015-04-14T10:41:19Z

Coverage decreased (-0.13%) to 72.49% when pulling 635957f on wjps:master into 409670b on pika:master.

gmr · 2015-04-29T01:02:42Z

There are quite a few changes in this PR. Please rebase down to a single commit, and reopen in a new PR referencing this one.

Will Slater added 4 commits April 11, 2015 18:42

Use non-blocking socket in Tornado

69d17df

Make _handle_write fixes to base_connection

a11ce5f

Move the _handle_write changes from tornado_connection to base_connection. They're equally applicable to all connections and this fixes a bug in master where BlockingConnections wouldn't send messages until close was called.

Make _handle_write fixes to base_connection

576c1f0

Move the _handle_write changes from tornado_connection to base_connection. They're equally applicable to all connections and this fixes a bug in master where BlockingConnections wouldn't send messages until close was called.

vitaly-krugl reviewed Apr 13, 2015
View reviewed changes

Will Slater added 2 commits April 13, 2015 11:08

Merge branch 'master' into fix_write_handling

cca26df

Various cleanups as suggested by @vitaly-krugl

f2dc430

Minor cleanups

7b474bb

remove unused import, dodgy error handling

Restore BlockingConnection socket.timeout behaviour

7d457f5

SSL fix for non-blocking reads

635957f

vitaly-krugl mentioned this pull request Apr 19, 2015

SelectConnection fails with socket.timeout most of the time when publishing a large number of large messages #538

Closed

gmr closed this Apr 29, 2015

wjps mentioned this pull request May 1, 2015

A fix the write starvation problem that we see with tornado and pika #556

Closed

wjps mentioned this pull request May 18, 2015

A fix the write starvation problem that we see with tornado and pika #578

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write starvation fixes #545

write starvation fixes #545

wjps commented Apr 12, 2015

coveralls commented Apr 12, 2015

vitaly-krugl Apr 13, 2015

wjps Apr 13, 2015

vitaly-krugl commented Apr 13, 2015

vitaly-krugl Apr 13, 2015

vitaly-krugl Apr 13, 2015

vitaly-krugl Apr 13, 2015

coveralls commented Apr 13, 2015

wjps commented Apr 13, 2015

gmr commented Apr 13, 2015

coveralls commented Apr 13, 2015

coveralls commented Apr 13, 2015

coveralls commented Apr 13, 2015

coveralls commented Apr 13, 2015

coveralls commented Apr 14, 2015

gmr commented Apr 29, 2015

write starvation fixes #545

write starvation fixes #545

Conversation

wjps commented Apr 12, 2015

coveralls commented Apr 12, 2015

vitaly-krugl Apr 13, 2015

Choose a reason for hiding this comment

wjps Apr 13, 2015

Choose a reason for hiding this comment

vitaly-krugl commented Apr 13, 2015

vitaly-krugl Apr 13, 2015

Choose a reason for hiding this comment

vitaly-krugl Apr 13, 2015

Choose a reason for hiding this comment

vitaly-krugl Apr 13, 2015

Choose a reason for hiding this comment

coveralls commented Apr 13, 2015

wjps commented Apr 13, 2015

gmr commented Apr 13, 2015

coveralls commented Apr 13, 2015

coveralls commented Apr 13, 2015

coveralls commented Apr 13, 2015

coveralls commented Apr 13, 2015

coveralls commented Apr 14, 2015

gmr commented Apr 29, 2015