Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A fix the write starvation problem that we see with tornado and pika #556

Closed
wants to merge 11 commits into from
Closed

A fix the write starvation problem that we see with tornado and pika #556

wants to merge 11 commits into from

Conversation

wjps
Copy link
Contributor

@wjps wjps commented May 1, 2015

This is #545 rebased to master.

As noted in the original PR this is primarily to fix a problem we see
with the tornado adapter but seems to be applicable to all adapters
so the patch addresses the issue in base_connection.

By default (in base_connection) Pika buffers all writes and only
attempts to send the next time it drops into the ioloop and detects
the socket as writable. This causes some odd behaviour in a number
of cases.

  1. A process generating large numbers of messages will not actually
    send them until it finishes the processing and drops into the ioloop.
  2. A process that is consuming a large queue will only send messages
    when it's read buffer is empty. If the messages are small this means
    it may end up consuming 000s of messages for every one it manages to
    publish. This behaviour stalls pipelines of processes.

SelectConnection had previously overridden _flush_outbound to force
a drop into the ioloop with a write_only flag set. This fixed
the behaviour for this connection type but left all the others broken.

This patch tries to send the data on the socket as soon as it's
generated and handles failed sends due to the socket buffer being full
by requeing the data.

Vitaly Kruglikov and others added 3 commits March 29, 2015 12:43
Rework all the SelectConnection pollers to behave like a
standard ioloop and deal with multiple filedescriptors.

Also fix the timeout handling code so that timeouts fire when
they are scheduled to do so rather than on a periodic timer.

Add an interrupt socketpair so that a second thread can interrupt
the ioloop if required (to make it exit). That's not to say it's
the use of threads is heavily tested or recommended.

This work is required to make full non-blocking connect possible
in pika. It will also allow connections to multiple rabbitmq servers
to be handled in a single ioloop. The SelectConnection ioloop can
now also be used by other code that wants to deal with non pika
sockets using the a single generic ioloop.
By default (in base_connection) Pika buffers all writes and only
attempts to send the next time it drops into the ioloop and detects
the socket as writable. This causes some odd behaviour in a number
of cases.
1. A process generating large numbers of messages will not actually
send them until it finishes the processing and drops into the ioloop.
2. A process that is consuming a large queue will only send messages
when it's read buffer is empty. If the messages are small this means
it may end up consuming 000s of messages for every one it manages to
publish. This behaviour stalls pipelines of processes.

SelectConnection had previously overridden _flush_outbound to force
a drop into the ioloop with a write_only flag set. This fixed
the behaviour for this connection type but left all the others broken.

This patch tries to send the data on the socket as soon as it's
generated and handles failed sends due to the socket buffer being full
by requeing the data.
@wjps
Copy link
Contributor Author

wjps commented May 1, 2015

Hmm, not sure what's going on there. The tests run fine here, will take a look.

@wjps
Copy link
Contributor Author

wjps commented May 2, 2015

@gmr, don't know if you have any ideas? It seems to have failed on py2.6 but I've run the tests on python2.6 over and over and cannot reproduce, I've also now set-up travis on my own repo and that's passed the same code!?

I suspect it's some timing related issue with the test and may have been triggered by the removal of the _flush_outbound in SelectConnection but, not being able to reproduce, it's hard to say either way. I can kill this PR and re-push but I'd rather not spam you with PRs either.

@vitaly-krugl
Copy link
Member

@wjps, it does sound like a race condition. Can you try to reproduce it by setting up a long-running test loop and letting it run for a day in the failing environment?

Looking at the failed test's log, something is clearly out of whack. Either something is impacting the processing/order of synchronous or other commands or something external deleted the queue.

queue q53882832 declaration appears to succeed at first:

pika.callback: DEBUG: Removing callback #0: {'callback': <bound method TestZ_PublishAndConsume.on_queue_declared of <select_adapter_tests.TestZ_PublishAndConsume testMethod=start_test>>, 'only': None, 'one_shot': True, 'arguments': {'queue': 'q53882832'}, 'calls': 0}

But then, basic_consume returns NOT_FOUND - no queue \'q53882832\':

pika.channel: INFO: <METHOD(['channel_number=1', 'frame_type=1', 'method=<Channel.Close([\'class_id=60\', \'method_id=20\', \'reply_code=404\', "reply_text=NOT_FOUND - no queue \'q53882832\' in vhost \'/\'"])>'])>
pika.channel: WARNING: Received remote Channel.Close (404): NOT_FOUND - no queue 'q53882832' in vhost '/'

…ection with broker and implement acceptance tests for those cases.

Fixed typo fwd.close() with fwd.stop()

Use array.array instead of bytearray to work around a bug in python 2.6: http://bugs.python.org/issue7827

Handle ECONNRESET in ForwardServer

Shorten test docstrings

Removed unused local variable; added docstring in ForwardServer.running property getter

Minor comment cleanup
@wjps
Copy link
Contributor Author

wjps commented May 3, 2015

ok there's definitely some sort of race in there, I can reproduce the failure on both master and my PR for both TestZ_PublishAndConsume and TestZ_PublishAndConsumeBig (more easily on the latter).

Sometimes it seems to take several thousand iterations to do so though, loading the machine up does appear to make it more easy to trigger..

@wjps
Copy link
Contributor Author

wjps commented May 4, 2015

Ok the problem was with the tests, they were using a value in seconds rather than ms as the x-expires argument to the queue.declare. Sometimes the queue was getting deleted before the rest of the test could run. Fixed in #558

wjps and others added 7 commits May 4, 2015 12:16
Fix incorrect x-expires argument in acceptance tests
Get BlockingConnection into consistent state upon loss of TCP/IP connection with broker + acceptance tests
Make SelectConnection behave like an ioloop
Remove unused self.fd attribute from BaseConnection
By default (in base_connection) Pika buffers all writes and only
attempts to send the next time it drops into the ioloop and detects
the socket as writable. This causes some odd behaviour in a number
of cases.
1. A process generating large numbers of messages will not actually
send them until it finishes the processing and drops into the ioloop.
2. A process that is consuming a large queue will only send messages
when it's read buffer is empty. If the messages are small this means
it may end up consuming 000s of messages for every one it manages to
publish. This behaviour stalls pipelines of processes.

SelectConnection had previously overridden _flush_outbound to force
a drop into the ioloop with a write_only flag set. This fixed
the behaviour for this connection type but left all the others broken.

This patch tries to send the data on the socket as soon as it's
generated and handles failed sends due to the socket buffer being full
by requeing the data.
…ite-starvation-fixes

Conflicts:
	pika/adapters/select_connection.py
@vitaly-krugl
Copy link
Member

Looks good to me

@wjps
Copy link
Contributor Author

wjps commented May 18, 2015

@gmr, any thoughts on merging this?

@gmr
Copy link
Member

gmr commented May 18, 2015

There are a lot of commits in this and I'd rather not review the whole chain, any chance of a rebase flattening the PR?

@wjps
Copy link
Contributor Author

wjps commented May 18, 2015

sure np.

On Mon, 18 May 2015 at 17:38 Gavin M. Roy notifications@github.com wrote:

There are a lot of commits in this and I'd rather not review the whole
chain, any chance of a rebase flattening the PR


Reply to this email directly or view it on GitHub
#556 (comment).

@wjps
Copy link
Contributor Author

wjps commented May 18, 2015

sent a new PR #578

@wjps wjps closed this May 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants