Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tx/block service: replace internal pub-sub model for push-pull #469

Merged
merged 1 commit into from
Aug 8, 2018

Conversation

thecodefactory
Copy link
Member

No description provided.

@thecodefactory
Copy link
Member Author

thecodefactory commented Jul 8, 2018

Following a comment previously present, this optimizes the transaction publisher socket by connecting only once instead of during each publication. Without this, we observably fail to deliver all transactions by seeing regular gaps in the sequence numbers. @jachiang

EDIT: While we could do something similar for the block service, I haven't personally witnessed the sequence gaps there.

@coveralls
Copy link

coveralls commented Jul 8, 2018

Coverage Status

Coverage remained the same at ?% when pulling ff11d77 on thecodefactory:static-socket into c8eaaf0 on libbitcoin:version3.

Copy link
Member

@evoskuil evoskuil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above is static, not thread static. Also that is described as an optimization, but in the commit comments this is described as a bug fix. It’s not clear to me that a bug has been identified.

@thecodefactory thecodefactory force-pushed the static-socket branch 2 times, most recently from 7c06a32 to 50bbdd0 Compare July 8, 2018 03:31
@thecodefactory
Copy link
Member Author

Bug fix or optimization, either way -- depends, on advertised behavior, but publications happen either way.

@evoskuil
Copy link
Member

evoskuil commented Jul 8, 2018

Sorry, I don’t understand that comment. What is the bug and its cause?

@thecodefactory
Copy link
Member Author

Currently, reconnecting at every publish does not fail, nor does it publish every transaction. Try to subscribe to any existing v3 transaction service and observe the discrepancy in what the server thinks it's publishing (via verbose logs) and what zmq is actually delivering (via any client). @jachiang observed this a while ago and I spent some time looking at the client side, until I realized there was nothing wrong on that side, except that messages were being dropped somehow.

The persistent socket connection also does not fail, and appears to better ensure message delivery. I've logged subscriptions from different clients (bx subscribe-tx and python subscribe-tx) over longer timeframes (several hours) with this change and haven't missed any yet.

@evoskuil
Copy link
Member

evoskuil commented Jul 8, 2018

My question is why are txs missed.

@evoskuil
Copy link
Member

evoskuil commented Jul 8, 2018

Also I believe there is a cross plat issue with your thread static implementation, as well as a cleanup issue. Working with statics is hard, which is why I deferred it. But if there is a bug we need to know what it is in order to know it’s fixed.

@thecodefactory
Copy link
Member Author

Also I believe there is a cross plat issue with your thread static implementation, as well as a cleanup issue. Working with statics is hard, which is why I deferred it. But if there is a bug we need to know what it is in order to know it’s fixed.

In what way? It's C++11, should work anywhere? As for cleanup, there's no restart/resubscribe support, so it follows the existing cleanup semantics.

Why TXs are missing? I'm not sure, it appears to be zmq specific behaviour. I can only think in the time the pub socket is not connected, there is no knowledge of any subscribers, and there may be some time between each reconnect and subscriber awareness before it's torn down and brought back up. Since there are (almost?) never 2 sequential tx sequences as it is currently, it appears to be related to some lag around that connect.

@thecodefactory
Copy link
Member Author

I think your earlier questions regarding bug/optimization are applicable here because I'm not sure there's a bug, per se. The code works, it just seems to not perform well (and deliver all messages). But given what the service is supposed to do, calling that a bug fix or an optimization depends on how you look at it.

As for how this is addressed, well, that's the point of this PR/review/discussion. I wouldn't have thought of using thread local (had it not been for the existing comment), and probably would have opted for a member/shared ptr, which could allow easier cleanup on shutdown.

On the other hand, maybe nothing needs to be done and we just document that it's a best effort publisher (since it really is anyway, given that it can drop under load).

@evoskuil
Copy link
Member

evoskuil commented Jul 8, 2018

If it’s droping messages without hitting the high water mark it’s a bug. Is seems that if that is the case the proposed solution implies a race condition. But I see no solution to a race condition in the changes.

The thread local comment dealt only with reducing the number of connections. This could not have been done with a member as there is one class and potentially multiple threads. Each thread must establish its own connection to the endpoint.

If it’s simply a question of performance that must be a consequence of hitting the high water mark earlier than expected. This can be verified by simply increasing the HWM in config.

@evoskuil
Copy link
Member

evoskuil commented Jul 8, 2018

The reason we’ve used boost::thread is that there is an injectable thread destructor that can be used to clean up thread static state. But in this case the threads are created independently from the class that creates the connection objects. This makes it hard to inject a cleanup closure into the thread destructor. Also, I haven’t had a chance to review, but it may be the case that the cleanup must occur before the threads terminate. That would make cleanup even more interesting.

@thecodefactory
Copy link
Member Author

@evoskuil You are correct that this change is not needed, while it does seem to help. I've been looking at this for a while and zmq provides no errors while monitoring, but since they don't allow inspection into internal queue state either, it's very hard to tell when dropping will happen. I read that drops can happen well before the actual hwm is reached also, which doesn't help close inspection from the outside.

The issue appears to be an internal message loss (even with send/recv hwm set insanely high at 10K and even at 100K, both for testnet tx subscription) between the pub-sub from the worker to the bound subscriber that relays to the external facing pub socket. Note that in this case, even with large hwm, no zmq queues are established (in this case on the sending side) until the connection is made (i.e. during every tx publish), then 1 send is done before it's no longer used. In this case it may actually hurt more to setup large queues.

In any case, what seems to work reliably, but perhaps isn't desirable is internally using the push/pull model (making the xsub a puller and the worker connector a pusher; pair cannot work since it requires exclusivity, but would otherwise be well suited). This would mean that there is queuing between the internal worker endpoint, but the outgoing external messages can still be dropped in error state to the real subscribers/clients.

I realize the internal pub/sub model was chosen because it allows drops and therefore can't die from OOM during a sustained tx broadcast storm, so it's unclear how to proceed.

I have verified that zmq sends are properly done on the server side (i.e. there are no hidden socket errors), but are dropped and not making it to the client (via wire inspection). It's possible that we're losing data both internally and externally too, but at least in my setup, I'm ruling out the external scenario because allowing queuing internally appears to 'fix' the external scenario.

If any of this doesn't make sense (other than the obvious mystery of why we're getting drops), let me know and I'll update the PR (it's very simple, just changing the internal zmq socket types) and we can go from there. As-is, I think this PR is not needed, so we should either go this newly proposed route and I'll update the PR, or I'll close this as we discuss a different/safer alternative. Also, it's possible that nothing is needed at all because it does appear that everything is working as intended, just with more drops than expected. Perhaps there's a solution to that I'm not aware of, though I've been reading zmq docs quite a bit ;-)

@evoskuil
Copy link
Member

evoskuil commented Aug 7, 2018

I would at some point like to maintain the connections on a thread basis, allowing internal queuing and reducing connection latency. However it's good to know that the issue is message dropping by the xsub. I'm okay with converting that to push/pull, however at some point we should figure out what's going on with xsub. Maybe a query to the zmq peeps is in order.

@thecodefactory
Copy link
Member Author

Maybe a query to the zmq peeps is in order.

Agreed. I may also inspect there to see where it's happening, but it's unfortunate that they don't allow insight into what's going on via API. It might be much more than I'd like to get into if I dig down there, but may provide easy clarity as well.

@evoskuil
Copy link
Member

evoskuil commented Aug 7, 2018

Opacity of the queue is a design decision intended to retain zmq control over the queue as private state.

@thecodefactory thecodefactory changed the title Optimize transaction publisher socket by connecting only once tx service: replace internal pub-sub model for push-pull Aug 8, 2018
@thecodefactory
Copy link
Member Author

Will test this change for blocks too, although perhaps it may be less obvious if it's an improvement. Will update if it seems ok.

@thecodefactory thecodefactory changed the title tx service: replace internal pub-sub model for push-pull tx/block service: replace internal pub-sub model for push-pull Aug 8, 2018
@thecodefactory
Copy link
Member Author

Updated.

@thecodefactory thecodefactory merged commit 9432c21 into libbitcoin:version3 Aug 8, 2018
Copy link
Member

@evoskuil evoskuil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update names for style and run down the question of proper use of xpub/xsub with zeromq.

@@ -73,7 +73,7 @@ bool block_service::start()
void block_service::work()
{
zmq::socket xpub(authenticator_, role::extended_publisher, external_);
zmq::socket xsub(authenticator_, role::extended_subscriber, internal_);
zmq::socket xsub(authenticator_, role::puller, internal_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should read:

zmq::socket puller(authenticator_, role::puller, internal_);

The same applies to all other changes in the PR.

@@ -73,7 +73,7 @@ bool block_service::start()
void block_service::work()
{
zmq::socket xpub(authenticator_, role::extended_publisher, external_);
zmq::socket xsub(authenticator_, role::extended_subscriber, internal_);
zmq::socket xsub(authenticator_, role::puller, internal_);

// Bind sockets to the service and worker endpoints.
if (!started(bind(xpub, xsub)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for using XPUB is to proxy it against XSUB. I was not aware that XPUB could proxy against anything other than XSUB. I'm not confident we have this right. See ZeroMQ documentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think you're right. zmq usually fails or fails to deliver for unsupported operations, so I'm not sure how they let this slide ;-) I'll ask about it, and also update this.

@thecodefactory thecodefactory deleted the static-socket branch August 27, 2018 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants