Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network protocol suggestion: getfullblocks #522

Closed
ixje opened this issue Dec 13, 2018 · 24 comments
Closed

Network protocol suggestion: getfullblocks #522

ixje opened this issue Dec 13, 2018 · 24 comments
Labels

Comments

@ixje
Copy link
Contributor

@ixje ixje commented Dec 13, 2018

There are some improvements proposed in #366
I'd like to discuss one more that is also backwards compatible.

The current getblocks command is inefficient and also has a misleading name. The name suggests you will receive blocks. In reality you get hashes (via an inv message). You then basically repackage the same information of this inv message but now with the getdata command to finally receive the blocks messages. This is a 4-step process:

  1. -> getblocks
  2. <- inv with hash list
  3. -> getdata with the received hash list
  4. <- n times block (for all requested hashes)

We can do the same process in just 2-steps.

Proposal
I'd like to propose a getfullblocks command (or something along those lines) that will do the above in 2 steps (specifically only step 1 & 4). This is nearly identical to the current getblocks but instead of sending an inv it will instantly return the requested blocks.

Untested example, but should be close enough to explain the idea

private void OnGetFullBlocksMessageReceived(GetBlocksPayload payload)
{
    UInt256 hash = payload.HashStart[0];
    if (hash == payload.HashStop) return;
    BlockState state = Blockchain.Singleton.Store.GetBlocks().TryGet(hash);
    if (state == null) return;

    for (uint i = 1; i <= InvPayload.MaxHashesCount; i++)
    {
        uint index = state.TrimmedBlock.Index + i;
        if (index > Blockchain.Singleton.Height)
            break;
        hash = Blockchain.Singleton.GetBlockHash(index);
        if (hash == null) break;
        if (hash == payload.HashStop) break;

        // code extracted from OnGetDataMessageReceived()
        Blockchain.Singleton.RelayCache.TryGet(hash, out IInventory inventory);
        if (inventory == null)
            inventory = Blockchain.Singleton.GetBlock(hash);
        if (inventory is Block block)
        {
            if (bloom_filter == null)
            {
                Context.Parent.Tell(Message.Create("block", inventory));
            }
            else
            {
                BitArray flags = new BitArray(block.Transactions.Select(p => bloom_filter.Test(p)).ToArray());
                Context.Parent.Tell(Message.Create("merkleblock", MerkleBlockPayload.Create(block, flags)));
            }
        }
        // end code extracted from OnGetDataMessageReceived()
    }
}

The above is basically a combination of OnGetBlocksMessageReceived() and OnGetDataMessageReceived()

Benefits

  • less network traffic
  • self-explanatory command instead of misleading command
  • less business logic for retrieving blocks from the network.

Possible improvements
allow querying data using a start_index , stop_index. This would mean we cannot re-use the GetBlocksPayload but requires a new one.

Benefits of these parameters are;

  • smaller getfullblock message, as you do not need any UInt256 hashes to start requesting data, just an int for the index
  • does not require knowing any hashes to get started on block data syncing. Meaning; no need to know the genesis block hash to start from scratch, and no need to know the inv payload structure for deserialising a hash that was broadcasted on the network. You can skip all unnecessary data and just go straight to retrieving blocks.
@vncoelho

This comment has been minimized.

Copy link
Member

@vncoelho vncoelho commented Dec 13, 2018

I think I agree @ixje, I already openned PR related to some small points you mentioned.

In a first read I had a good fell about the compilation you did and looks correct. Lets move towards these improvements.

@vncoelho

This comment has been minimized.

Copy link
Member

@vncoelho vncoelho commented Dec 13, 2018

The part of allowing query data with start a finish is kind of already implemented but not used.

@erikzhang

This comment has been minimized.

Copy link
Member

@erikzhang erikzhang commented Dec 14, 2018

This design makes sense, let me explain.

Suppose there are 4 nodes A, B, C, and D. Node A requests blocks 100~200 from the other 3 nodes. So he will send:

getblocks [100~200]

The problem here is that node A does not know if other nodes have these blocks, so it must send getblocks messages to all connected nodes.

Then Node B reply the message:

inv [100~150]

Node B has only blocks of 100 to 150 height.

Node C reply the message:

inv [100~200]

Node C has all the blocks.

Node D says nothing because he is the new one here too.

Then Node A send messages to B and C:

getdata [100~150] // to C
getdata [150~200] // to D

This way it can download blocks from multiple nodes at the same time.

If there are no inv and getdata messages, then all nodes may send blocks to node A at the same time, increasing the load on the network.

@ixje

This comment has been minimized.

Copy link
Contributor Author

@ixje ixje commented Dec 14, 2018

Erik, I understand your example. It does however assume that all clients request data from multiple nodes simultaneously. I found that syncing from 1 good node is superior to syncing from multiple nodes that can vary in quality/response times.

So a different design can be a SyncManager that uses just 1 node for requesting all the data. Then a NodeManager can track peer health and make sure SyncManager always uses the best available known peer (because peer quality can deteriorate over time).

In my observations of last weeks I found that really good nodes tend to reply with actual data < 0.2s. Good nodes average around < 0.8s, and medium/poor < 2s. Everything above I considered a bad node. Ultimately, block download speed will always be faster than block processing speed once the number of transactions increases.

Regarding knowing what blocks a node has; when we connect to a node we can get this information from the VersionPayload

StartHeight = startHeight,

Of course this has its limitations at some point. However, the current logic of getblocks already allows you to just give a start_hash and it will never give you more data than it has (up to a limit of 500). We can replicate that with the getfullblocks command.

@jsolman

This comment has been minimized.

Copy link
Contributor

@jsolman jsolman commented Dec 14, 2018

I also observed that there is nothing to rank node health currently and there probably should be some ranking. However, it is important that the network not be designed to create a situation where all nodes hammer the most powerful node in the network. Higher powered nodes may have reasons to be higher powered besides wanting to bear a heavier burden of serving all other nodes in the network.

@ixje

This comment has been minimized.

Copy link
Contributor Author

@ixje ixje commented Dec 14, 2018

I don't think it is possible to prevent this anyway unless you're going to use some central arbiter to control it (which doesn't sound good). In the current 4-step design a client can already apply the same strategy of selecting the best nodes (according to his/her criteria). Note that "best node" is partially region dependent. A European/American peer will unlikely ever select an Asian hosted node as the fastest performing node, simply based on network latency alone.

@jsolman

This comment has been minimized.

Copy link
Contributor

@jsolman jsolman commented Dec 14, 2018

I don’t think it is possible to prevent this anyway unless your’re going to use some central arbiter to control it (which doesn’t sound good).

It is certainly possibly to write code that spreads out the network load and prevents nodes hammering the same host if they are running the Neo core. The current code certainly doesn’t generally cause hammering the host with lowest network latency and it doesn’t use any central arbiter.

While anyone can attack a publicly facing node, the networking protocol of nodes running the official client should not exhibit such behavior.

That being said I do think the protocol should try to do some things to optimize connecting to lower latency nodes and probably keep some metric of node health.

Having requests nearly always routed to the lowest latency node to which one is currently connected might work nicely for your one client, but is likely to cause problems if all nodes were running such a client.

@ixje

This comment has been minimized.

Copy link
Contributor Author

@ixje ixje commented Dec 14, 2018

Having requests nearly always routed to the lowest latency node to which one is currently connected might work nicely for your one client, but is likely to cause problems if all nodes were running such a client.

I don't really see why that would cause big problems. e.g. East-Asian nodes will never get a lower network latency to nodes on other continents than to their 'domestic' ones. This holds true for nodes on all continents. Within your continent you will not at all times know the addresses of all available nodes, which is one possible reason why you won't all hit the same node.

If the best node gets overloaded it will either hit its max connected peers, or its performance will degrade to the point it's no longer considered best then nodes will move on to another host. This results in spread and performance can restore on the node that degraded. The former does assume that we do not just look at network latency, but also at actual data response/request latency. To elaborate what I mean with that; a node can have a low network latency but at the same time queue all your messages (or even dropping them) so having a poor "data request/response latency". It's the latter that matters most imo, not the former.

This is not just theoretical, the NGD nodes are a real life example of this. My network latency to those
is ~150ms, but data response can be > 45 seconds.

@vncoelho

This comment has been minimized.

Copy link
Member

@vncoelho vncoelho commented Dec 14, 2018

Hi, @erikzhang, by reading your explanation I agree with your design and it makes sense to me.
I am in favor of this template of decentralizing the requests as much as possible.

On the other hand, I fell that we need some minor adjustments on the TaskManager and ProtocolHandlers for really handling Block request following your proposed instructions.

@jsolman

This comment has been minimized.

Copy link
Contributor

@jsolman jsolman commented Dec 16, 2018

@ixje I agree that measuring network latency and entire "data request request/response" latency and keeping track of these per remote node can help optimizing the network and potentially protecting against bad actors. Recently, to prevent waiting too long for block data, there was the change that allows the block to be requested from up to 3 remote nodes. If we knew more about the health (average network and data latencies) of the nodes from which blocks are requested, we could weight down talking to unhealthy nodes and have better confidence with using less redundancy. That would require a number of changes to TaskManger.

Currently the first to respond with block hashes are the ones that get the task to return block data. This already should have the effect somewhat of having nodes with lower latency serve requests; however, responding to getblocks could be mostly an in memory operation, while responding to getdata for a block far in the past could be more a disk operation. So, a node asking for some blocks far back in the chain may be faster than others to respond to getblocks and slower than others to respond to getdata. Having metrics of 'data request/response latency' for connected remote nodes could help factor this in if a scheduling algorithm were built into TaskManger that took into account more information about all the connected remote nodes when making data requests.

@ixje

This comment has been minimized.

Copy link
Contributor Author

@ixje ixje commented Dec 17, 2018

@jsolman I think we're on the same line here 👍 . The only difference is that you seem to talk from the neo/neo-cli perspective and what needs to be done there, whereas I'm looking more from a language agnostic point of view (given that my main project is neo-python).

@vncoelho

This comment has been minimized.

Copy link
Member

@vncoelho vncoelho commented Dec 19, 2018

@ixje,

with this last bug fix #524 the network reached another level of stability!
In all my tests with delays and network fails the nodes are recovering quite quickly.

I suggest that we do not modify anything right now.
Maybe we can re-think about it in the future. Let's give some time to analyze the current behavior.

@ixje

This comment has been minimized.

Copy link
Contributor Author

@ixje ixje commented Dec 19, 2018

@vncoelho I do not consider the issue you mentioned as directly related to this one. That one was a client specific issue (and good that it is solved), but this getfullblocks suggestion is client agnostic.

@vncoelho

This comment has been minimized.

Copy link
Member

@vncoelho vncoelho commented Dec 19, 2018

@ixje I agree that it is different.
You can see my comment above in the beginning of this thread that I present my support to these novel strategies.

What I mean is that the network is behaving smoothly and, in this sense, novel strategies should be analyzed with careful.

@jsolman

This comment has been minimized.

Copy link
Contributor

@jsolman jsolman commented Jan 12, 2019

@ixje I thought I'd point out that the full block can be retrieved today from RPC nodes by passing the block number to the 'getblock' method.

This isn't part of the normal network protocol since it isn't really that useful since nodes need to know who has a block first before getting it, and the nodes use a push mechanism of letting each other know when they have blocks rather than a pull mechanism of asking other nodes if they have block 'x'. Also, since nodes usually get headers first they already also know all the block hashes when syncing, so they can call 'getdata' directly with the block hash if they want to get block data in 1 call. Therefore the proposal of the additional command would not really not help anything for the normal network protocol, and for RPC the method already exists.

To be clear; once synced, in normal operation, nodes broadcast inv as soon as they persist a block. This allows the receiving nodes to know they can immediately ask for data for the new block. So there is no 4-step process in normal operation -- it is just 2 steps.

  1. -> node broadcasts inv message once they have a new block
  2. <- receiver asks for the new block data using data message

That being said, what I discussed earlier about keeping track of health of nodes (latency, availability, etc. should probably have a separate issue opened for it)

@jsolman

This comment has been minimized.

Copy link
Contributor

@jsolman jsolman commented Jan 12, 2019

I think we can close this and prefer to take up what could be most useful from this discussion on #542 that I have opened.

@jsolman jsolman closed this Jan 12, 2019
@ixje

This comment has been minimized.

Copy link
Contributor Author

@ixje ixje commented Jan 12, 2019

@ixje I thought I'd point out that the full block can be retrieved today from RPC nodes by passing the block number to the 'getblock' method.

This isn't part of the normal network protocol since it isn't really that useful since nodes need to know who has a block first before getting it

I have to say I disagree with your opinion that it is not useful as part of the normal network protocol. First, as pointed out before you know what blocks a node has by the StartHeight in the VersionPayload when connecting. That's how to old neo-project code even decided to start requesting data or not (don't know what it is like now since moving to Akka).

Second, getheaders does exactly the same as I proposed but that only returns headers. You give a start hash, receive up to 2000 hashes. Rinse and repeat.

If I would want to sync from zero with the proposed solution I would;

  • send getblocks with startHash set to the genesis block hash and no stopHash (or with an index if the alternative approach is accepted)
  • receive up to max count blocks
  • process/persist
  • take the hash or index from the last block, rinse and repeat previous steps until in sync.

The proposal is not necessarily asking for replacement, but for opening up to alternative designs from the C# project.


Also, since nodes usually get headers first they already also know all the block hashes when syncing, so they can call 'getdata' directly with the block hash if they want to get block data in 1 call.

This is an assumption probably based on how neo/neo-cli works. Neotracker doesn't use it, my ongoing re-write for neo-python doesn't use it, and it wouldn't surprise me others clients won't.

Given the above I disagree with the closing of this issue. Your new issue only keeps improvements to the C# project in mind.

Having said that, I actually think we need a neutral discussion place where we can propose NEO ideas/solutions. This issue tracker is bound to the C# project and that seems to automatically push peoples thinking to be within the boundaries of the C# project instead of to the vision of the NEO project (I've seen this multiple times now). Let me know if I somewhere missed that the vision has become that the whole world will just have to interface with some C# clients. In that case I'll simply withdraw.

@jsolman jsolman reopened this Jan 13, 2019
@jsolman

This comment has been minimized.

Copy link
Contributor

@jsolman jsolman commented Jan 13, 2019

Why can’t other projects also sync headers?

@jsolman

This comment has been minimized.

Copy link
Contributor

@jsolman jsolman commented Jan 13, 2019

@ixje I understand your frustration and don’t want you to feel like I’m not being neutral about the issue. Let’s leave the issue open and get more people’s opinions.

I do think getting headers first is probably a better design in general. For instance, it makes it less likely to get the wrong block at a height.

Anyway, it would be easy to add support for what you are asking.

@erikzhang what do you think about supporting a message in the core protocol that will return the block data given the block height for alternative implementations to use?

@ixje

This comment has been minimized.

Copy link
Contributor Author

@ixje ixje commented Jan 13, 2019

Why can’t other projects also sync headers?

It's not that they can't, it's that they not always have a need for it. Not every node is interested in also sharing headers/blocks on the network. The most obvious example being a tracker like neo-scan/neotracker. A statistics website might just want to collect data, not share. Enabling sharing of data also increases the load on the host, which might not be interested in that.

Is it the best for the network? Arguably not. But it is possible and there are legitimate reasons for it, so people will do it.


I do think getting headers first is probably a better design in general. For instance, it makes it less likely to get the wrong block at a height.

Can you elaborate why it would be less likely to get the wrong block? The first thing that comes to mind is that recently many testnet clients were stuck on a faulty block 2116137 even though they used the headers first, blocks later sequence. I'm not really seeing what a single header can provide more than a block (which contains all information).

@jsolman

This comment has been minimized.

Copy link
Contributor

@jsolman jsolman commented Jan 13, 2019

The most obvious example being a tracker like neo-scan/neotracker.

It can use the RPC call getblock To get the block by block number.

@jsolman

This comment has been minimized.

Copy link
Contributor

@jsolman jsolman commented Jan 13, 2019

Can you elaborate why it would be less likely to get the wrong block? The first thing that comes to mind is that recently many testnet clients were stuck on a faulty block 2116137 even though they used the headers first,

I have a change that never gets stuck when syncing headers first. The only reason they got stuck is because they accepted headers when the last header was the stuck block due to the issue where two blocks can get created at the same height. If you sync headers and never save the last header received you will never get stuck. I run code that does this on my nodes and they don’t get stuck now.

@jsolman

This comment has been minimized.

Copy link
Contributor

@jsolman jsolman commented Jan 13, 2019

I should note though that the two block created at same height will be fixed anyway soon when the commit phase consensus change goes in.

@ixje

This comment has been minimized.

Copy link
Contributor Author

@ixje ixje commented Jan 13, 2019

The most obvious example being a tracker like neo-scan/neotracker.

It can use the RPC call getblock To get the block by block number.

That's a (valid) work around, but is that really what you want? Now you have to host a C# node just so you can call some RPC method in a reliable way.

If you sync headers and never save the last header received you will never get stuck.

I lost you there. Are you perhaps on NEO discord and willing to discuss it shortly? I have the same user name there.

@shargon shargon mentioned this issue May 13, 2019
@ixje ixje closed this Jun 14, 2019
@ixje ixje reopened this Oct 2, 2019
@Tommo-L Tommo-L mentioned this issue Nov 4, 2019
8 of 31 tasks complete
vncoelho added a commit that referenced this issue Nov 7, 2019
* add GetFullBlocks P2P logic

* add missing new line

* allow request genesis block

* Optimization

* Update MessageCommand.cs

* - rename command
- fix protocol handler cast
- fix payload deserialization for default value

* change to ushort per review

* remove default count, rename class

* typo + failed refactor coverage ¯\_(ツ)_/¯
shargon added a commit that referenced this issue Nov 14, 2019
* Fix GetTransactionFromBlock Syscall (#1170)

* Fix Syscall

* Fix

* Unit Test for Smartcontract Module (#1090)

* submit ut

* fix

* Enhance functions in TestDataCache

* git amend blockchain

* fix test related to TestDataCache

* dotnet format

* dotnet format

* add blank line

* fix test

* Optimize random

* Optimize Random

* fix test

* add decimal test

* fix

* 2019/9/25 16:54

change format

* Fixes events

* update assertion sentence

* update UT following code change

* format

* add type check

* recommit

* recommit

* Ensure txs are cleared before Blockchain actor (#1166)

* Unit Test For RPC Module (#1111)

* fix payload limits (#1194)

* Fix JsonSerializer (#1197)

* Fix #1128 (#1129)

* Fix #1128

* Update BlockBase.cs

* Add shutdown event for plugins (#1195)

* add shutdown event for plugins

* use IDisposable

* dispose plugins first

* Removing brackets

* Replace function exceptwith and unionwith with faster functions  (#1174)

* Replace ExceptWith & UnionWith with equal but faster functionality

* Optimization

* Optimization

* Optimize remove

* Update neo/Network/P2P/TaskManager.cs

Co-Authored-By: Erik Zhang <erik@neo.org>

* Code optimization

* Update Helper.cs

* Small change

* Optimization

* Update Helper.cs

* Revert

* Optimization

* Optimize FIFOSet

* Rename

* Inline

* Update UT_FIFOSet.cs

* Fix

* Update UT_FIFOSet.cs

* Update FIFOSet.cs

* Update FIFOSet.cs

* Revert FIFOSet

* Update Helper.cs

* Optimize

* Reverting independet byte checks to SequenceEqual

* Unit tests for some auxiliary classes (#1192)

* Fix p2p filter unconnected peers (#1160)

* Remove the case of GetData in ProtocolHandlerMailbox#ShallDrop (#1201)

* Add GetFullBlocks P2P logic (enable fixing #522) (#1138)

* add GetFullBlocks P2P logic

* add missing new line

* allow request genesis block

* Optimization

* Update MessageCommand.cs

* - rename command
- fix protocol handler cast
- fix payload deserialization for default value

* change to ushort per review

* remove default count, rename class

* typo + failed refactor coverage ¯\_(ツ)_/¯

* Use base64 in JsonSerializer (#1199)

* Base64 Json

* No format

* Json Transaction optimization

* Change to Base64

* Revert some changes

* Revert

* Remove Helper.Base64

* Remove Base64FormattingOptions.None

* Optimize MerkleTree (3x) (#1203)

* Optimize MerkleTree

* Update MerkleTree.cs

* Update MerkleTree.cs

* Added more verbosity to CN logs (#1202)

* Add reason log for cn

* Update ConsensusService.cs

* Keep track of sender fee (#1183)

* Keep track of sender fee to avoid duplicate computation

* Code optimization

* Optimize

* Optimize

* Optimize

* Code optimization

* Correction

* Renaming currentFee to totalSenderFeeFromPool

* Renaming on Verify as well

* Add consideration for null Transactions

* Move sender fee recording systems to class SendersMonitor

* Code optimization

* Capitalize public items

* Code optimization

* Code optimization

* Optimization

* Code optimization

* Using problem description (#1214)

Using problem description helps problem grouping (people may propose different solutions for the same problem)

* Improve SYSCALLs: `Neo.Crypto.*` (#1190)

* Prevent XXE (3x) (#1229)

* Remove unnecessary logic from Mempool (#1216)

* Prevent remove storage flag when there are something stored (#1227)

* Add NeoAPI and update code after testing with NEO3 preview1 (#1150)
Tommo-L added a commit to Tommo-L/neo that referenced this issue Jan 10, 2020
…ect#1138)

* add GetFullBlocks P2P logic

* add missing new line

* allow request genesis block

* Optimization

* Update MessageCommand.cs

* - rename command
- fix protocol handler cast
- fix payload deserialization for default value

* change to ushort per review

* remove default count, rename class

* typo + failed refactor coverage ¯\_(ツ)_/¯
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

4 participants
You can’t perform that action at this time.