A fix the write starvation problem that we see with tornado and pika #556

wjps · 2015-05-01T14:29:17Z

This is #545 rebased to master.

As noted in the original PR this is primarily to fix a problem we see
with the tornado adapter but seems to be applicable to all adapters
so the patch addresses the issue in base_connection.

By default (in base_connection) Pika buffers all writes and only
attempts to send the next time it drops into the ioloop and detects
the socket as writable. This causes some odd behaviour in a number
of cases.

A process generating large numbers of messages will not actually
send them until it finishes the processing and drops into the ioloop.
A process that is consuming a large queue will only send messages
when it's read buffer is empty. If the messages are small this means
it may end up consuming 000s of messages for every one it manages to
publish. This behaviour stalls pipelines of processes.

SelectConnection had previously overridden _flush_outbound to force
a drop into the ioloop with a write_only flag set. This fixed
the behaviour for this connection type but left all the others broken.

This patch tries to send the data on the socket as soon as it's
generated and handles failed sends due to the socket buffer being full
by requeing the data.

Rework all the SelectConnection pollers to behave like a standard ioloop and deal with multiple filedescriptors. Also fix the timeout handling code so that timeouts fire when they are scheduled to do so rather than on a periodic timer. Add an interrupt socketpair so that a second thread can interrupt the ioloop if required (to make it exit). That's not to say it's the use of threads is heavily tested or recommended. This work is required to make full non-blocking connect possible in pika. It will also allow connections to multiple rabbitmq servers to be handled in a single ioloop. The SelectConnection ioloop can now also be used by other code that wants to deal with non pika sockets using the a single generic ioloop.

By default (in base_connection) Pika buffers all writes and only attempts to send the next time it drops into the ioloop and detects the socket as writable. This causes some odd behaviour in a number of cases. 1. A process generating large numbers of messages will not actually send them until it finishes the processing and drops into the ioloop. 2. A process that is consuming a large queue will only send messages when it's read buffer is empty. If the messages are small this means it may end up consuming 000s of messages for every one it manages to publish. This behaviour stalls pipelines of processes. SelectConnection had previously overridden _flush_outbound to force a drop into the ioloop with a write_only flag set. This fixed the behaviour for this connection type but left all the others broken. This patch tries to send the data on the socket as soon as it's generated and handles failed sends due to the socket buffer being full by requeing the data.

wjps · 2015-05-01T14:46:21Z

Hmm, not sure what's going on there. The tests run fine here, will take a look.

wjps · 2015-05-02T09:24:09Z

@gmr, don't know if you have any ideas? It seems to have failed on py2.6 but I've run the tests on python2.6 over and over and cannot reproduce, I've also now set-up travis on my own repo and that's passed the same code!?

I suspect it's some timing related issue with the test and may have been triggered by the removal of the _flush_outbound in SelectConnection but, not being able to reproduce, it's hard to say either way. I can kill this PR and re-push but I'd rather not spam you with PRs either.

vitaly-krugl · 2015-05-02T16:35:02Z

@wjps, it does sound like a race condition. Can you try to reproduce it by setting up a long-running test loop and letting it run for a day in the failing environment?

Looking at the failed test's log, something is clearly out of whack. Either something is impacting the processing/order of synchronous or other commands or something external deleted the queue.

queue q53882832 declaration appears to succeed at first:

pika.callback: DEBUG: Removing callback #0: {'callback': <bound method TestZ_PublishAndConsume.on_queue_declared of <select_adapter_tests.TestZ_PublishAndConsume testMethod=start_test>>, 'only': None, 'one_shot': True, 'arguments': {'queue': 'q53882832'}, 'calls': 0}

But then, basic_consume returns NOT_FOUND - no queue \'q53882832\':

pika.channel: INFO: <METHOD(['channel_number=1', 'frame_type=1', 'method=<Channel.Close([\'class_id=60\', \'method_id=20\', \'reply_code=404\', "reply_text=NOT_FOUND - no queue \'q53882832\' in vhost \'/\'"])>'])>
pika.channel: WARNING: Received remote Channel.Close (404): NOT_FOUND - no queue 'q53882832' in vhost '/'

…ection with broker and implement acceptance tests for those cases. Fixed typo fwd.close() with fwd.stop() Use array.array instead of bytearray to work around a bug in python 2.6: http://bugs.python.org/issue7827 Handle ECONNRESET in ForwardServer Shorten test docstrings Removed unused local variable; added docstring in ForwardServer.running property getter Minor comment cleanup

wjps · 2015-05-03T10:50:24Z

ok there's definitely some sort of race in there, I can reproduce the failure on both master and my PR for both TestZ_PublishAndConsume and TestZ_PublishAndConsumeBig (more easily on the latter).

Sometimes it seems to take several thousand iterations to do so though, loading the machine up does appear to make it more easy to trigger..

wjps · 2015-05-04T11:15:25Z

Ok the problem was with the tests, they were using a value in seconds rather than ms as the x-expires argument to the queue.declare. Sometimes the queue was getting deleted before the rest of the test could run. Fixed in #558

Fix incorrect x-expires argument in acceptance tests

Get BlockingConnection into consistent state upon loss of TCP/IP connection with broker + acceptance tests

Make SelectConnection behave like an ioloop

Remove unused self.fd attribute from BaseConnection

By default (in base_connection) Pika buffers all writes and only attempts to send the next time it drops into the ioloop and detects the socket as writable. This causes some odd behaviour in a number of cases. 1. A process generating large numbers of messages will not actually send them until it finishes the processing and drops into the ioloop. 2. A process that is consuming a large queue will only send messages when it's read buffer is empty. If the messages are small this means it may end up consuming 000s of messages for every one it manages to publish. This behaviour stalls pipelines of processes. SelectConnection had previously overridden _flush_outbound to force a drop into the ioloop with a write_only flag set. This fixed the behaviour for this connection type but left all the others broken. This patch tries to send the data on the socket as soon as it's generated and handles failed sends due to the socket buffer being full by requeing the data.

…ite-starvation-fixes Conflicts: pika/adapters/select_connection.py

vitaly-krugl · 2015-05-06T21:58:53Z

Looks good to me

wjps · 2015-05-18T16:29:40Z

@gmr, any thoughts on merging this?

gmr · 2015-05-18T16:38:46Z

There are a lot of commits in this and I'd rather not review the whole chain, any chance of a rebase flattening the PR?

wjps · 2015-05-18T16:46:04Z

sure np.

On Mon, 18 May 2015 at 17:38 Gavin M. Roy notifications@github.com wrote:

There are a lot of commits in this and I'd rather not review the whole
chain, any chance of a rebase flattening the PR

—
Reply to this email directly or view it on GitHub
#556 (comment).

wjps · 2015-05-18T17:20:21Z

sent a new PR #578

Vitaly Kruglikov and others added 3 commits March 29, 2015 12:43

Remove unused self.fd attribute from BaseConnection

71bc0eb

wjps and others added 7 commits May 4, 2015 12:16

Merge branch 'fix-test-timeouts' into write-starvation-fixes

9405208

Merge pull request #558 from wjps/fix-test-timeouts

9b33033

Fix incorrect x-expires argument in acceptance tests

Merge pull request #557 from vitaly-krugl/blocking-adapter-disconnect

495d67c

Get BlockingConnection into consistent state upon loss of TCP/IP connection with broker + acceptance tests

Merge pull request #555 from wjps/select-connection-ioloop

8efddcf

Make SelectConnection behave like an ioloop

Merge pull request #554 from vitaly-krugl/removed-unused-self.fd

062ecbc

Remove unused self.fd attribute from BaseConnection

Merge branch 'write-starvation-fixes' of github.com:wjps/pika into wr…

16bbc51

…ite-starvation-fixes Conflicts: pika/adapters/select_connection.py

This was referenced May 4, 2015

Make SelectConnection behave like an ioloop #555

Merged

Make SelectConnection behave like an ioloop - bug fix #559

Merged

wjps mentioned this pull request May 18, 2015

A fix the write starvation problem that we see with tornado and pika #578

Merged

wjps closed this May 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A fix the write starvation problem that we see with tornado and pika #556

A fix the write starvation problem that we see with tornado and pika #556

wjps commented May 1, 2015

wjps commented May 1, 2015

wjps commented May 2, 2015

vitaly-krugl commented May 2, 2015

wjps commented May 3, 2015

wjps commented May 4, 2015

vitaly-krugl commented May 6, 2015

wjps commented May 18, 2015

gmr commented May 18, 2015

wjps commented May 18, 2015

wjps commented May 18, 2015

A fix the write starvation problem that we see with tornado and pika #556

A fix the write starvation problem that we see with tornado and pika #556

Conversation

wjps commented May 1, 2015

wjps commented May 1, 2015

wjps commented May 2, 2015

vitaly-krugl commented May 2, 2015

wjps commented May 3, 2015

wjps commented May 4, 2015

vitaly-krugl commented May 6, 2015

wjps commented May 18, 2015

gmr commented May 18, 2015

wjps commented May 18, 2015

wjps commented May 18, 2015