Skip to content

Conversation

betachen
Copy link
Member

@betachen betachen commented Oct 9, 2016

the dns-name query maybe hits a unreachable host, if only query once.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.005%) to 1.843% when pulling e159e7f on betachen:master into b5aebbb on libbitcoin:master.

@betachen
Copy link
Member Author

@evoskuil

compiled error on MacOS, I found:

Makefile.am:3: error: Libtool library used but 'LIBTOOL' is undefined
Makefile.am:3: The usual way to define 'LIBTOOL' is to add 'LT_INIT'
Makefile.am:3: to 'configure.ac' and run 'aclocal' and 'autoconf' again.
Makefile.am:3: If 'LT_INIT' is in 'configure.ac', make sure
Makefile.am:3: its definition is in aclocal's search path.

@evoskuil
Copy link
Member

evoskuil commented Oct 10, 2016

At some point our Travis OSX builds just started failing for insufficient libtool version, which I assume was caused a platform change by Travis, since our builds hadn't changed materially. We updated the build generation to uninstall and install libtool on the OSX builds, which resolved the issue. Then a couple of days later this error started up.

I'm traveling and haven't had time to look at it. If you can successfully patch configuration I will apply to update via 'libbitcoin-build' (or you can do so). 'LT_INIT' is configured so presumably there is an issue with the search path.

Copy link
Member

@evoskuil evoskuil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great fix, sorry I didn't realize that this is what you were referring to in your initial question on the subject. Thanks for submitting.


safe_connect(iterator, socket, timer, handle_connect);
auto do_connecting = [&](boost::asio::ip::tcp::resolver::iterator iter){
const auto timeout = settings_.connect_timeout();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: use 4 spaces per indent (vs. tabs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: avoid abbreviations (iter), balance curly braces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: use existing type alias (asio::iterator).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all fixed.

safe_connect(iter, socket, timer, handle_connect);
};

//get all hosts under one seed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generalize comments ("DNS record" vs. "seed"), since this affects all outbound network connections, not just seeding.

Style: use sentences (leading space, leading cap, period) in comments, e.g.:

// Get all hosts under one DNS record.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be nice if you squashed the two commits to clarify history.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All fixed.

shared_from_this(), _1, socket, handle_connect));

safe_connect(iterator, socket, timer, handle_connect);
auto do_connecting = [&](boost::asio::ip::tcp::resolver::iterator iter){
Copy link
Member

@evoskuil evoskuil Oct 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limit scope of closure by passing this in this case vs. &.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, already fixed. using "[this, &handler]".

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.005%) to 1.843% when pulling e2b8cb3 on betachen:master into b5aebbb on libbitcoin:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.005%) to 1.843% when pulling a9d7bb9 on betachen:master into b5aebbb on libbitcoin:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.005%) to 1.843% when pulling f959ab9 on betachen:master into b5aebbb on libbitcoin:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.005%) to 1.843% when pulling f959ab9 on betachen:master into b5aebbb on libbitcoin:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.005%) to 1.843% when pulling e8ecd1b on betachen:master into b5aebbb on libbitcoin:master.

@betachen
Copy link
Member Author

rising a new issue:

17:06:52.730210 INFO [network] Failure contacting seed [seed.bitnodes.io:8333] unable to reach remote host
17:06:52.730339 ERROR [network] Error seeding host addresses: operation failed
17:06:52.730460 ERROR [node] Node failed to start with error, operation failed.
17:06:52.730606 INFO [node] Please wait while the node is stopping...

confused~ however, it's passed when I tested it yesterday.

By the way, why does libbitcoin have to start listening 8333 after connecting seeds?
I remember that they are in unrelated different threads in bitcoin-core.

// Get all hosts under one DNS record.
for (asio::iterator end; iterator != end; ++iterator)
{
do_connecting(iterator);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will continue to loop following the first connection, yet the completion handler will have already been called. This extends the thread execution unnecessarily after connect. I reccomend converting safe connect and this closure to bool and using success as a second loop termination condition.

@evoskuil
Copy link
Member

confused~ however, it's passed when I tested it yesterday.

Lack of network connectivity would do it.

By the way, why does libbitcoin have to start listening 8333 after connecting seeds?

It starts listening (if configured for inbound connections) following start (sync completion in node) so that it can receive inbound connections.

I remember that they are in unrelated different threads in bitcoin-core.

Libbitcoin design isn't based on the satoshi client. The listen call is asynchronous, there would be no reason to dedicate a thread to its invocation.

@betachen
Copy link
Member Author

betachen commented Oct 11, 2016

I have tested agin.

  • in case: the first dns-seed reach normally =>Error seeding host addresses: operation failed
  • in case: the first dns-seed cannot reach =>startup normally.
    which let io_service still in same period, so other connecting returned in time, right?

I've fixed this issue already with "std::promise" and "std::future", is that ok?

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.008%) to 1.84% when pulling 215088e on betachen:master into b5aebbb on libbitcoin:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.008%) to 1.84% when pulling 215088e on betachen:master into b5aebbb on libbitcoin:master.

@evoskuil
Copy link
Member

evoskuil commented Oct 11, 2016

HI @betachen, thanks for your work on this. As an aside, I should have pointed out the style guide previously. Two things I noticed in the changes are the use of raw pointers and line length.

Also, my comments above on loop termination are incorrect (the async connect cannot return a result). Sorry for the misdirection, I was working from my phone and didn't have the bigger picture.

Use of std::promise should be avoided in this case. It produces the intended sequence of operations, but in doing so it ties up a thread. This line: promise.get_future().get(); blocks while while the async_connect(...) call awaits a response. Basically it turns asynchronous execution in to synchronous execution. The proactor pattern used in libbitcoin networking can be quite confusing until one gets familiar with it.

I don't entirely follow your latest description, but it seems from your patch that you desire to execute the connection attempts sequentially vs. concurrently. Despite the sequential implementation, the implementation above will result in the completion handler being invoked more than once, which will produce catastrophic results.

The synchronize(handler, 1, NAME, mode); line invokes the completion handler upon its first (1) invocation, successful or otherwise. This exists to manage the race between the connection timeout timer and the (single) asynchronous connection attempt. Placing this into a loop repeats the same operation for each DNS record, resulting in one call to the completion handler for each DNS record. Yet after the first call to the completion handler it is presumed to be out of scope.

I assume that the failure you were encountering results from the fact that when any DNS record fails (or times out) the completion handler is invoked with an error code. This is then in a race with any successful completion, as both have invoked the completion handler. Assuming the handler remains in scope this will produce unintended results for the caller, which is relying on the contract of exactly one invocation of the handler.

If the DNS records can be contacted in parallel, which I assume is possible, then the proper implementation would be to use two synchronizers, one for the race with the set of connection attempts against the timer and the other for the race between the set of connection attempts (against each other). The latter synchronizer would have be configured to terminate on first success or set exhaustion, and the timer synchronizer would have to terminate on timer firing or other synchronizer completion. Then you would simply loop over the connection attempts as you did in the earlier pull request.

However, parallel connection to a single physical host via it's multiple DNS records is probably not so friendly, assuming more than one of the records is reachable. Connecting sequentially requires a different design (the completion handler must be forwarded), and raises the question of what is the proper timeout. A long list of DNS host records could significantly delay connection, so I would not reset the timeout for each one.

@evoskuil
Copy link
Member

Also, it looks like reverting the libtool reinstall has resolved the Travis OSX build breaks. I assume that Travis discovered the original problem and fixed it, at which point our patch started breaking things.

@evoskuil
Copy link
Member

Looking more closely at the scenario, it seems that this iteration is incorrect. According to boost documentation of async_connect:

This function attempts to connect a socket to one of a sequence of endpoints. It does this by repeated calls to the socket's async_connect member function, once for each endpoint in the sequence, until a connection is successfully established.

The "sequence" is provided to the via iterator. Passing an incremented iterator in a loop as above will result in one call to the first host, two calls to the second, three calls to the third, and so on.

@betachen
Copy link
Member Author

Hi @evoskuil , thanks for your answer and patience.
I am reading style_guide, the src of libbitcoin is pretty and tidy for me, it's useful.

Actually I have got catastrophic results on this commit.
It seems normally startup, however, node is always interrupted by the next completion event or time out event.
secondly, as you said, asynchronous execution in to synchronous execution, sync headers slows down in synchronous way, extremely slow.

The finally solution:

If the DNS records can be contacted in parallel, which I assume is possible, then the proper implementation would be to use two synchronizers, one for the race with the set of connection attempts against the timer and the other for the race between the set of connection attempts (against each other). The latter synchronizer would have be configured to terminate on first success or set exhaustion, and the timer synchronizer would have to terminate on timer firing or other synchronizer completion. Then you would simply loop over the connection attempts as you did in the earlier pull request.

I'm afraid I have to digest it for a while firstly.

However, parallel connection to a single physical host via it's multiple DNS records is probably not so friendly, assuming more than one of the records is reachable. Connecting sequentially requires a different design (the completion handler must be forwarded), and raises the question of what is the proper timeout. A long list of DNS host records could significantly delay connection, so I would not reset the timeout for each one.

Yes, in earlier commit, "hosts.cache" has cached 1,000 hosts. however, about 50% hosts is unreachable in China. This will slow the procedure of searching available hosts.
That is, If we got too much hosts cache, we have to re-design timeout. right?

@evoskuil
Copy link
Member

evoskuil commented Oct 11, 2016

@betachen The current configurability should allow you to compensate for unreachable hosts. The default is based on the (non-China) experience of about 1 in 5 hosts being reachable (which targets a 20% success rate vs. your 50%). With 50% you would set connect_batch_size = 2 (2 concurrent attempts for every desired connection). This would allow you to get one connection per attempt on average.

You could set this number very high depending on circumstance. Your machine probably doesn't have 500 threads, but the asynchronous connection process spends most of its time waiting on the network, so this works just fine even with a low timeout. Extending the logical thread count via config can also help (by allowing more attempts to be started concurrently):

[network]
# The number of threads in the network threadpool, defaults to 50.
threads = 100
# The number of concurrent attempts to establish one connection, defaults to 5.
connect_batch_size = 500

Of course if connection establishment takes more time (which is probably the case for many of your peers) then increasing various timeouts can help:

[network]
# The time limit for connection establishment, defaults to 5.
connect_timeout_seconds = 5
# The time limit to complete the connection handshake, defaults to 30.
channel_handshake_seconds = 30
# The maximum time limit for obtaining seed addresses, defaults to 30.
channel_germination_seconds = 30

Interestingly, if you have low timeouts you tend to get faster peers when you do connect. Also with larger batches you also tend to get faster peers. In both cases faster peers with the race.

I wouldn't mess with this one unless you have a problem getting blocks during initial block download:

[node]
# The time limit for block receipt during initial block download, defaults to 5.
block_timeout_seconds = 5

@evoskuil
Copy link
Member

I am reading style_guide, the src of libbitcoin is pretty and tidy for me, it's useful.

Thanks, we work hard on this because it is a development library and the source is meant to educate.

@evoskuil
Copy link
Member

@betachen I'm closing this PR because the boost documentation makes it clear that iteration over the DNS records within the iterator is handled internal to async_connect(...). Please continue the thread in a new issue or on the IRC or mailing list if you have more questions.

@evoskuil evoskuil closed this Oct 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants