-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A fix the write starvation problem that we see with tornado and pika #556
Conversation
Rework all the SelectConnection pollers to behave like a standard ioloop and deal with multiple filedescriptors. Also fix the timeout handling code so that timeouts fire when they are scheduled to do so rather than on a periodic timer. Add an interrupt socketpair so that a second thread can interrupt the ioloop if required (to make it exit). That's not to say it's the use of threads is heavily tested or recommended. This work is required to make full non-blocking connect possible in pika. It will also allow connections to multiple rabbitmq servers to be handled in a single ioloop. The SelectConnection ioloop can now also be used by other code that wants to deal with non pika sockets using the a single generic ioloop.
By default (in base_connection) Pika buffers all writes and only attempts to send the next time it drops into the ioloop and detects the socket as writable. This causes some odd behaviour in a number of cases. 1. A process generating large numbers of messages will not actually send them until it finishes the processing and drops into the ioloop. 2. A process that is consuming a large queue will only send messages when it's read buffer is empty. If the messages are small this means it may end up consuming 000s of messages for every one it manages to publish. This behaviour stalls pipelines of processes. SelectConnection had previously overridden _flush_outbound to force a drop into the ioloop with a write_only flag set. This fixed the behaviour for this connection type but left all the others broken. This patch tries to send the data on the socket as soon as it's generated and handles failed sends due to the socket buffer being full by requeing the data.
Hmm, not sure what's going on there. The tests run fine here, will take a look. |
@gmr, don't know if you have any ideas? It seems to have failed on py2.6 but I've run the tests on python2.6 over and over and cannot reproduce, I've also now set-up travis on my own repo and that's passed the same code!? I suspect it's some timing related issue with the test and may have been triggered by the removal of the _flush_outbound in SelectConnection but, not being able to reproduce, it's hard to say either way. I can kill this PR and re-push but I'd rather not spam you with PRs either. |
@wjps, it does sound like a race condition. Can you try to reproduce it by setting up a long-running test loop and letting it run for a day in the failing environment? Looking at the failed test's log, something is clearly out of whack. Either something is impacting the processing/order of synchronous or other commands or something external deleted the queue. queue q53882832 declaration appears to succeed at first:
But then, basic_consume returns
|
…ection with broker and implement acceptance tests for those cases. Fixed typo fwd.close() with fwd.stop() Use array.array instead of bytearray to work around a bug in python 2.6: http://bugs.python.org/issue7827 Handle ECONNRESET in ForwardServer Shorten test docstrings Removed unused local variable; added docstring in ForwardServer.running property getter Minor comment cleanup
ok there's definitely some sort of race in there, I can reproduce the failure on both master and my PR for both TestZ_PublishAndConsume and TestZ_PublishAndConsumeBig (more easily on the latter). Sometimes it seems to take several thousand iterations to do so though, loading the machine up does appear to make it more easy to trigger.. |
Ok the problem was with the tests, they were using a value in seconds rather than ms as the x-expires argument to the queue.declare. Sometimes the queue was getting deleted before the rest of the test could run. Fixed in #558 |
Fix incorrect x-expires argument in acceptance tests
Get BlockingConnection into consistent state upon loss of TCP/IP connection with broker + acceptance tests
Make SelectConnection behave like an ioloop
Remove unused self.fd attribute from BaseConnection
By default (in base_connection) Pika buffers all writes and only attempts to send the next time it drops into the ioloop and detects the socket as writable. This causes some odd behaviour in a number of cases. 1. A process generating large numbers of messages will not actually send them until it finishes the processing and drops into the ioloop. 2. A process that is consuming a large queue will only send messages when it's read buffer is empty. If the messages are small this means it may end up consuming 000s of messages for every one it manages to publish. This behaviour stalls pipelines of processes. SelectConnection had previously overridden _flush_outbound to force a drop into the ioloop with a write_only flag set. This fixed the behaviour for this connection type but left all the others broken. This patch tries to send the data on the socket as soon as it's generated and handles failed sends due to the socket buffer being full by requeing the data.
…ite-starvation-fixes Conflicts: pika/adapters/select_connection.py
Looks good to me |
@gmr, any thoughts on merging this? |
There are a lot of commits in this and I'd rather not review the whole chain, any chance of a rebase flattening the PR? |
sure np. On Mon, 18 May 2015 at 17:38 Gavin M. Roy notifications@github.com wrote:
|
sent a new PR #578 |
This is #545 rebased to master.
As noted in the original PR this is primarily to fix a problem we see
with the tornado adapter but seems to be applicable to all adapters
so the patch addresses the issue in base_connection.
By default (in base_connection) Pika buffers all writes and only
attempts to send the next time it drops into the ioloop and detects
the socket as writable. This causes some odd behaviour in a number
of cases.
send them until it finishes the processing and drops into the ioloop.
when it's read buffer is empty. If the messages are small this means
it may end up consuming 000s of messages for every one it manages to
publish. This behaviour stalls pipelines of processes.
SelectConnection had previously overridden _flush_outbound to force
a drop into the ioloop with a write_only flag set. This fixed
the behaviour for this connection type but left all the others broken.
This patch tries to send the data on the socket as soon as it's
generated and handles failed sends due to the socket buffer being full
by requeing the data.