Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage stomp heartbeat in a separate thread. #65

Closed
wants to merge 1 commit into from
Closed

Conversation

ralphbean
Copy link
Member

Hopefully, this will resolve all kinds of issues we've been seeing with stomp
heartbeat management. With the call to callLater, the same thread is used
but unfortunately the call is often made after the interval elapses, even
when the queue is not that busy.

By putting heartbeat events in a separate thread, hopefully the hub can keep up
with the promise it made to the broker.

Hopefully, this will resolve all kinds of issues we've been seeing with stomp
heartbeat management.  With the call to `callLater`, the same thread is used
but unfortunately the call is often made *after* the `interval` elapses, even
when the queue is not that busy.

By putting heartbeat events in a separate thread, hopefully the hub can keep up
with the promise it made to the broker.
@mprahl
Copy link
Contributor

mprahl commented Sep 20, 2018

👍 makes sense

@ralphbean
Copy link
Member Author

Will wait to merge until I get some feedback from Wai and Luiz. They're testing it out for real in the dev instances of their apps.

@wcheang
Copy link

wcheang commented Sep 20, 2018

Ran a few rounds of tests sending a bunch of messages to the dev host with the patch, it didn't seem to miss any messages and I don't see any logs about reconnect. We had problems with stale clients being created last time before we turned off stomp heartbeat though, so I'd let it run for at least overnight to confirm it's not a problem.

@@ -155,12 +156,13 @@ def failover(self):

def start_heartbeat(self, interval):
self._heartbeat_enabled = True
reactor.callLater(interval / 1000.0, self.heartbeat, interval)
reactor.callInThread(self.heartbeat, interval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a glance, this doesn't look right. Nothing in self.heartbeat blocks, so deferring it to a thread is not only unnecessary, it's unsafe because self.proto is (presumably) a Twisted protocol which you should never call outside the reactor thread. If you're finding that callLater isn't called around when you want it to, it's because something is running blocking code in the reactor thread which is when you should use this API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the first point: see that a new (disappointing) time.sleep call is added in self.heartbeat.

On the second point: I have the same worry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you can't rely on interacting with the protocol object outside the twisted thread, it's not thread-safe. You could call back in with deferFromThread (or whatever it's called), but you're going to run into the same scheduling issue you're trying to avoid. Whatever is resulting in blocking calls (the consumers?) needs to run it its own thread.

@ralphbean
Copy link
Member Author

ralphbean commented Sep 24, 2018

OK - I think @jeremycline's observations mean this PR is a dead-end.

We do already have consumers running in their own threads, but only if the hub is not running in blocking mode... and in that case, the hub will ack messages as soon as they arrive - not after they are processed.

We're going to need:

  • A patch in moksha that (when blocking mode is set to false) will defer ACKing until after the message is processed by its consumers. Today, when blocking mode is set to false, messages are acked as soon as they are added to the internal queue.
  • Durable queues in the new activemq upgrade.

After those two are complete, our preferred configuration for services should be:

  • Set blocking mode to false.
  • Set ack mode to client-individual.
  • Turn stomp_heartbeat on to some value (any value).

@mikebonnet
Copy link
Contributor

+1 I think the approach outlined by @ralphbean makes a lot of sense.

@ralphbean
Copy link
Member Author

Hm. @wcheang and @lcarva both report that this patch resolves an existing problem for them (in two different applications) and that it doesn't generate any other obvious side effects (like unsafe manipulation of that protocol object).

I'm going to not merge this, but I will include it temporarily in a release of python-moksha-hub in fedora. I'll let that sit in updates-testing indefinitely, and not plan to ship it to stable. It will be eventually superseded by a different solution (described above).

@ralphbean
Copy link
Member Author

@ralphbean ralphbean closed this Sep 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants