tx/block service: replace internal pub-sub model for push-pull #469

thecodefactory · 2018-07-08T02:17:33Z

No description provided.

thecodefactory · 2018-07-08T02:17:51Z

Following a comment previously present, this optimizes the transaction publisher socket by connecting only once instead of during each publication. Without this, we observably fail to deliver all transactions by seeing regular gaps in the sequence numbers. @jachiang

EDIT: While we could do something similar for the block service, I haven't personally witnessed the sequence gaps there.

coveralls · 2018-07-08T03:01:17Z

Coverage remained the same at ?% when pulling ff11d77 on thecodefactory:static-socket into c8eaaf0 on libbitcoin:version3.

evoskuil

The above is static, not thread static. Also that is described as an optimization, but in the commit comments this is described as a bug fix. It’s not clear to me that a bug has been identified.

thecodefactory · 2018-07-08T03:38:04Z

Bug fix or optimization, either way -- depends, on advertised behavior, but publications happen either way.

evoskuil · 2018-07-08T05:31:20Z

Sorry, I don’t understand that comment. What is the bug and its cause?

thecodefactory · 2018-07-08T12:24:20Z

Currently, reconnecting at every publish does not fail, nor does it publish every transaction. Try to subscribe to any existing v3 transaction service and observe the discrepancy in what the server thinks it's publishing (via verbose logs) and what zmq is actually delivering (via any client). @jachiang observed this a while ago and I spent some time looking at the client side, until I realized there was nothing wrong on that side, except that messages were being dropped somehow.

The persistent socket connection also does not fail, and appears to better ensure message delivery. I've logged subscriptions from different clients (bx subscribe-tx and python subscribe-tx) over longer timeframes (several hours) with this change and haven't missed any yet.

evoskuil · 2018-07-08T16:53:19Z

My question is why are txs missed.

evoskuil · 2018-07-08T16:57:08Z

Also I believe there is a cross plat issue with your thread static implementation, as well as a cleanup issue. Working with statics is hard, which is why I deferred it. But if there is a bug we need to know what it is in order to know it’s fixed.

thecodefactory · 2018-07-08T17:16:27Z

Also I believe there is a cross plat issue with your thread static implementation, as well as a cleanup issue. Working with statics is hard, which is why I deferred it. But if there is a bug we need to know what it is in order to know it’s fixed.

In what way? It's C++11, should work anywhere? As for cleanup, there's no restart/resubscribe support, so it follows the existing cleanup semantics.

Why TXs are missing? I'm not sure, it appears to be zmq specific behaviour. I can only think in the time the pub socket is not connected, there is no knowledge of any subscribers, and there may be some time between each reconnect and subscriber awareness before it's torn down and brought back up. Since there are (almost?) never 2 sequential tx sequences as it is currently, it appears to be related to some lag around that connect.

thecodefactory · 2018-07-08T18:02:38Z

I think your earlier questions regarding bug/optimization are applicable here because I'm not sure there's a bug, per se. The code works, it just seems to not perform well (and deliver all messages). But given what the service is supposed to do, calling that a bug fix or an optimization depends on how you look at it.

As for how this is addressed, well, that's the point of this PR/review/discussion. I wouldn't have thought of using thread local (had it not been for the existing comment), and probably would have opted for a member/shared ptr, which could allow easier cleanup on shutdown.

On the other hand, maybe nothing needs to be done and we just document that it's a best effort publisher (since it really is anyway, given that it can drop under load).

evoskuil · 2018-07-08T18:58:50Z

If it’s droping messages without hitting the high water mark it’s a bug. Is seems that if that is the case the proposed solution implies a race condition. But I see no solution to a race condition in the changes.

The thread local comment dealt only with reducing the number of connections. This could not have been done with a member as there is one class and potentially multiple threads. Each thread must establish its own connection to the endpoint.

If it’s simply a question of performance that must be a consequence of hitting the high water mark earlier than expected. This can be verified by simply increasing the HWM in config.

evoskuil · 2018-07-08T19:22:24Z

The reason we’ve used boost::thread is that there is an injectable thread destructor that can be used to clean up thread static state. But in this case the threads are created independently from the class that creates the connection objects. This makes it hard to inject a cleanup closure into the thread destructor. Also, I haven’t had a chance to review, but it may be the case that the cleanup must occur before the threads terminate. That would make cleanup even more interesting.

thecodefactory · 2018-08-07T21:00:18Z

@evoskuil You are correct that this change is not needed, while it does seem to help. I've been looking at this for a while and zmq provides no errors while monitoring, but since they don't allow inspection into internal queue state either, it's very hard to tell when dropping will happen. I read that drops can happen well before the actual hwm is reached also, which doesn't help close inspection from the outside.

The issue appears to be an internal message loss (even with send/recv hwm set insanely high at 10K and even at 100K, both for testnet tx subscription) between the pub-sub from the worker to the bound subscriber that relays to the external facing pub socket. Note that in this case, even with large hwm, no zmq queues are established (in this case on the sending side) until the connection is made (i.e. during every tx publish), then 1 send is done before it's no longer used. In this case it may actually hurt more to setup large queues.

In any case, what seems to work reliably, but perhaps isn't desirable is internally using the push/pull model (making the xsub a puller and the worker connector a pusher; pair cannot work since it requires exclusivity, but would otherwise be well suited). This would mean that there is queuing between the internal worker endpoint, but the outgoing external messages can still be dropped in error state to the real subscribers/clients.

I realize the internal pub/sub model was chosen because it allows drops and therefore can't die from OOM during a sustained tx broadcast storm, so it's unclear how to proceed.

I have verified that zmq sends are properly done on the server side (i.e. there are no hidden socket errors), but are dropped and not making it to the client (via wire inspection). It's possible that we're losing data both internally and externally too, but at least in my setup, I'm ruling out the external scenario because allowing queuing internally appears to 'fix' the external scenario.

If any of this doesn't make sense (other than the obvious mystery of why we're getting drops), let me know and I'll update the PR (it's very simple, just changing the internal zmq socket types) and we can go from there. As-is, I think this PR is not needed, so we should either go this newly proposed route and I'll update the PR, or I'll close this as we discuss a different/safer alternative. Also, it's possible that nothing is needed at all because it does appear that everything is working as intended, just with more drops than expected. Perhaps there's a solution to that I'm not aware of, though I've been reading zmq docs quite a bit ;-)

evoskuil · 2018-08-07T21:08:17Z

I would at some point like to maintain the connections on a thread basis, allowing internal queuing and reducing connection latency. However it's good to know that the issue is message dropping by the xsub. I'm okay with converting that to push/pull, however at some point we should figure out what's going on with xsub. Maybe a query to the zmq peeps is in order.

thecodefactory · 2018-08-07T21:21:40Z

Maybe a query to the zmq peeps is in order.

Agreed. I may also inspect there to see where it's happening, but it's unfortunate that they don't allow insight into what's going on via API. It might be much more than I'd like to get into if I dig down there, but may provide easy clarity as well.

evoskuil · 2018-08-07T21:33:09Z

Opacity of the queue is a design decision intended to retain zmq control over the queue as private state.

thecodefactory · 2018-08-08T01:27:16Z

Will test this change for blocks too, although perhaps it may be less obvious if it's an improvement. Will update if it seems ok.

model for push-pull

thecodefactory · 2018-08-08T03:18:58Z

Updated.

evoskuil

Please update names for style and run down the question of proper use of xpub/xsub with zeromq.

evoskuil · 2018-08-08T19:34:59Z

src/services/block_service.cpp

@@ -73,7 +73,7 @@ bool block_service::start()
 void block_service::work()
 {
    zmq::socket xpub(authenticator_, role::extended_publisher, external_);
-    zmq::socket xsub(authenticator_, role::extended_subscriber, internal_);
+    zmq::socket xsub(authenticator_, role::puller, internal_);


This line should read:

zmq::socket puller(authenticator_, role::puller, internal_);

The same applies to all other changes in the PR.

evoskuil · 2018-08-08T19:35:01Z

src/services/block_service.cpp

@@ -73,7 +73,7 @@ bool block_service::start()
 void block_service::work()
 {
    zmq::socket xpub(authenticator_, role::extended_publisher, external_);
-    zmq::socket xsub(authenticator_, role::extended_subscriber, internal_);
+    zmq::socket xsub(authenticator_, role::puller, internal_);

    // Bind sockets to the service and worker endpoints.
    if (!started(bind(xpub, xsub)))


The reason for using XPUB is to proxy it against XSUB. I was not aware that XPUB could proxy against anything other than XSUB. I'm not confident we have this right. See ZeroMQ documentation.

Ah, I think you're right. zmq usually fails or fails to deliver for unsupported operations, so I'm not sure how they let this slide ;-) I'll ask about it, and also update this.

thecodefactory requested a review from evoskuil July 8, 2018 02:17

evoskuil reviewed Jul 8, 2018

View reviewed changes

thecodefactory force-pushed the static-socket branch 2 times, most recently from 7c06a32 to 50bbdd0 Compare July 8, 2018 03:31

thecodefactory force-pushed the static-socket branch from 50bbdd0 to fa6bd92 Compare August 8, 2018 00:20

thecodefactory changed the title ~~Optimize transaction publisher socket by connecting only once~~ tx service: replace internal pub-sub model for push-pull Aug 8, 2018

thecodefactory mentioned this pull request Aug 8, 2018

Persistent internal tx/block publisher connections per thread. #479

Open

Replace internal tx and block subscription service pub-sub socket

ff11d77

model for push-pull

thecodefactory force-pushed the static-socket branch from fa6bd92 to ff11d77 Compare August 8, 2018 02:35

thecodefactory changed the title ~~tx service: replace internal pub-sub model for push-pull~~ tx/block service: replace internal pub-sub model for push-pull Aug 8, 2018

thecodefactory merged commit 9432c21 into libbitcoin:version3 Aug 8, 2018

evoskuil reviewed Aug 8, 2018

View reviewed changes

thecodefactory deleted the static-socket branch August 27, 2018 16:34

thecodefactory mentioned this pull request Jan 9, 2019

Enable consistent tx/block publishing #512

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tx/block service: replace internal pub-sub model for push-pull #469

tx/block service: replace internal pub-sub model for push-pull #469

thecodefactory commented Jul 8, 2018

thecodefactory commented Jul 8, 2018 •

edited

Loading

coveralls commented Jul 8, 2018 •

edited

Loading

evoskuil left a comment

thecodefactory commented Jul 8, 2018

evoskuil commented Jul 8, 2018 •

edited

Loading

thecodefactory commented Jul 8, 2018

evoskuil commented Jul 8, 2018

evoskuil commented Jul 8, 2018

thecodefactory commented Jul 8, 2018

thecodefactory commented Jul 8, 2018

evoskuil commented Jul 8, 2018 •

edited

Loading

evoskuil commented Jul 8, 2018 •

edited

Loading

thecodefactory commented Aug 7, 2018

evoskuil commented Aug 7, 2018

thecodefactory commented Aug 7, 2018

evoskuil commented Aug 7, 2018

thecodefactory commented Aug 8, 2018

thecodefactory commented Aug 8, 2018

evoskuil left a comment •

edited

Loading

evoskuil Aug 8, 2018

evoskuil Aug 8, 2018

thecodefactory Aug 8, 2018

tx/block service: replace internal pub-sub model for push-pull #469

tx/block service: replace internal pub-sub model for push-pull #469

Conversation

thecodefactory commented Jul 8, 2018

thecodefactory commented Jul 8, 2018 • edited Loading

coveralls commented Jul 8, 2018 • edited Loading

evoskuil left a comment

Choose a reason for hiding this comment

thecodefactory commented Jul 8, 2018

evoskuil commented Jul 8, 2018 • edited Loading

thecodefactory commented Jul 8, 2018

evoskuil commented Jul 8, 2018

evoskuil commented Jul 8, 2018

thecodefactory commented Jul 8, 2018

thecodefactory commented Jul 8, 2018

evoskuil commented Jul 8, 2018 • edited Loading

evoskuil commented Jul 8, 2018 • edited Loading

thecodefactory commented Aug 7, 2018

evoskuil commented Aug 7, 2018

thecodefactory commented Aug 7, 2018

evoskuil commented Aug 7, 2018

thecodefactory commented Aug 8, 2018

thecodefactory commented Aug 8, 2018

evoskuil left a comment • edited Loading

Choose a reason for hiding this comment

evoskuil Aug 8, 2018

Choose a reason for hiding this comment

evoskuil Aug 8, 2018

Choose a reason for hiding this comment

thecodefactory Aug 8, 2018

Choose a reason for hiding this comment

thecodefactory commented Jul 8, 2018 •

edited

Loading

coveralls commented Jul 8, 2018 •

edited

Loading

evoskuil commented Jul 8, 2018 •

edited

Loading

evoskuil commented Jul 8, 2018 •

edited

Loading

evoskuil commented Jul 8, 2018 •

edited

Loading

evoskuil left a comment •

edited

Loading