Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network encoding default now CBOR with envelope code optimization #1132

Merged
merged 5 commits into from
Aug 16, 2021

Conversation

simonhf
Copy link
Contributor

@simonhf simonhf commented Aug 12, 2021

Use CBOR with optimized envelope code for network message encoding; > 4x faster on localnet at 100 TPS.

@simonhf
Copy link
Contributor Author

simonhf commented Aug 12, 2021

Please note that this PR has also been tested here [1] with localnet at 100 TPS for 600 seconds.

Also please note that the pre-this-PR plumbing for CBOR was previously merged in this [2] previous PR, and has lay dormant until this PR :-)

[1] https://www.notion.so/dapperlabs/Change-network-message-encoding-to-CBOR-PR-52d29225c82e4369ba03ff326cf06b40
[2] #891

@yhassanzadeh13 yhassanzadeh13 self-assigned this Aug 12, 2021
@@ -182,26 +178,18 @@ func switchv2what(v interface{}) (string, error) {
return what, nil
}

func v2envEncode(v interface{}, via string) ([]byte, uint8, error) {
func v2envelopeCode(v interface{}) (string, uint8, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mind add some comments for this function. Am not 100% sure what is it doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishalchangrani thanks for the comment. As requested I added extra comments. Please let me know if it's clearer now :-)

err = encoder.Encode(v)
//binstat.LeaveVal(bs2, int64(data.Len()))
if err != nil {
return nil, fmt.Errorf("could not encode cbor payload of type %s: %w", what, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the what here dump the whole payload? should we instead dump just code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishalchangrani thanks for the comment. what is actually just the human readable text name for code. However, I updated the error message to contain both so it is clearer. So the payload is never dumped. Is the new error message clearer now?

err := d.dec.Decode(data)
//binstat.LeaveVal(bs, int64(len(data)))
//binstat.LeaveVal(bs1, int64(len(data)))
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err != nil {
if err != nil || len(data) == 0 {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishalchangrani thanks for the comment. I added the extra len(data) == 0. Thanks for spotting this :-)

@@ -130,7 +130,7 @@ func (fnb *FlowNodeBuilder) BaseFlags() {
func (fnb *FlowNodeBuilder) EnqueueNetworkInit(ctx context.Context) {
fnb.Component("network", func(builder NodeBuilder, node *NodeConfig) (module.ReadyDoneAware, error) {

codec := jsoncodec.NewCodec()
codec := cborcodec.NewCodec()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishalchangrani thanks for the comment!

Please see my response to @huitseeker below regarding testing. I'm wondering how I managed to run a 100 TPS load on localnet successfully for 600 seconds without the above change to access_node_builder.go? localnet does have an access node, so how could it have worked if it has a different codec?

Or is the link you kindly provided to the access node JSON configuration for something else, e.g. just for integration tests, or just for the access node communicating with the outside world, or something else not to do with network messages with the other nodes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah! I can see where this confusion can arise from. So, the EnqueueNetworkInit here is not your regular Network init but the init of the unstaked network optionally used by the the access node if it supports an unstaked network. Currently, by default, the access node will not bring up this network.

However, things have changed since I made the comment 😄 and we will actually be removing/modifying this significantly since we decided we will not have two separate networks. So, you can either just make the change for consistency or leave it as-is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishalchangrani Oh, thank you very much for explaining how the confusion occurred.

I think I'd prefer to just leave it as-is because... I don't currently have a good way to test it unless there is a localnet option to bring up and use the unstaked network somehow? Or a different way to test it? And it sounds like that code is both not being currently used, and there will be no separate network in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #1147 to record this for follow-up.

Copy link
Contributor

@vishalchangrani vishalchangrani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm..left u some comments.
thanks for doing this.

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues on the code, but can you tell me more about the testing / deployment plan for this feature?

Things I'd love to see (include, but are not limited to):

  • property testing of v2envelopeCode, envelopeCode2v, or similar exhaustive generation of the message types we expect,
  • bench/testnet-only options for dumping any problematic payload to a file to replay in integration tests later,
  • monitoring of deserialization errors,

code := data[len(data)-1] // only last byte
//binstat.LeaveVal(bs, int64(len(data)))
//bs1 := binstat.EnterTime(binstat.BinNet + ":wire>3(cbor)payload2envelope")
code := data[0] // only first byte
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an in-code documentation of this per-first-byte type discrimination? I expect this will eventually be part of public specs, but code comments would be a start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huitseeker thanks for the comment! FYI I added some more in-code comments for @vishalchangrani and they also mention code. Please let me know if these are sufficient, or if you have any more suggestions and/or wishes.

@codecov-commenter
Copy link

Codecov Report

Merging #1132 (7b3e47f) into master (c2dd9e0) will decrease coverage by 0.03%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1132      +/-   ##
==========================================
- Coverage   53.11%   53.08%   -0.04%     
==========================================
  Files         322      322              
  Lines       21830    21830              
==========================================
- Hits        11596    11588       -8     
- Misses       8659     8670      +11     
+ Partials     1575     1572       -3     
Flag Coverage Δ
unittests 53.08% <ø> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...sus/approvals/assignment_collector_statemachine.go 42.30% <0.00%> (-7.70%) ⬇️
cmd/util/ledger/migrations/storage_v4.go 41.56% <0.00%> (-0.61%) ⬇️
module/mempool/epochs/transactions.go 100.00% <0.00%> (+5.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c2dd9e0...7b3e47f. Read the comment docs.

@simonhf
Copy link
Contributor Author

simonhf commented Aug 13, 2021

@huitseeker thanks for the comment!

No issues on the code, but can you tell me more about the testing / deployment plan for this feature?

There are 4 levels at which the feature was tested:

  • When first creating the CBOR codec I relied on the integration tests which send messages around (albeit not a great selection of messages), but these take a long time to run and are therefore inconvenient for iterating quickly during development.
  • So I added to this package, a local round-trip test [1] which sped up development a lot.
  • I've tested extensively using a 10 node localnet with 100 TPS load for 600 seconds, while monitoring for non-info, non-debug level logs. For example, when epoch was added this broke the branch immediately because a new message type had been added, and was missing from the CBOR branch.
  • In addition to the localnet load testing, the slow bits of the network message payload transformation, for both JSON and CBOR are wrapped in binstat [2] instrumentation. This allowed me to relatively accurately measure the frequency, timing, size, and type of messages for the localnet load test for different scenarios, e.g. using JSON, using CBOR, and using CBOR with different compression options, etc. I could see that the number of messages sent and received for each message type remaining similar whether using JSON or CBOR encoding, and all without log errors; please see [3].

It could be that there are other message types not exercised by the localnet load test, and I'm hoping that if this is the case then these would be exercised at the latest in the following step:

Lastly, in addition -- and long before this code would get to mainnet -- I'm assuming this code would be tested further as part of pre-mainnet tests (e.g. on canary and benchnet), and that if there is any remaining issue then it would be found at the latest then.

Other assumptions that I have made about testing CBOR:

  • CBOR has been put into deterministic encoding mode (as you requested), but I have assumed this just works and have not created any tests to test CBOR itself, assuming that it has tests itself to do that.
  • I'm also assuming that CBOR is really good at encoding arbitrary Golang data structures, because it is used in so many other projects.
  • The main tricky part of getting CBOR to work was getting its handling of serialization of time and date working. If the timestamp format is incorrect then lots of errors occur because nodes will try to use an incorrect timestamp in crypto operations which will cause errors. But once this is working this in theory it will just work for any flow message?

Things I'd love to see (include, but are not limited to):

  • property testing of v2envelopeCode, envelopeCode2v, or similar exhaustive generation of the message types we expect,
  • bench/testnet-only options for dumping any problematic payload to a file to replay in integration tests later,
  • monitoring of deserialization errors,

Background: I just copy and pasted most of this code from the old JSON codec (still in the code) and it unfortunately does not contain the things loved to see :-(

Do you think the things loved to be seen should be in the scope of this PR, or could they be captured as new issues for later enhancement to testing and maintenance?

[1] https://github.com/onflow/flow-go/blob/simonhf/5610-optimize-network-message-serialization/network/codec/roundTripHeader_test.go
[2] https://github.com/onflow/flow-go/tree/master/utils/binstat
[3] https://www.notion.so/dapperlabs/Change-network-message-encoding-to-CBOR-PR-52d29225c82e4369ba03ff326cf06b40

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a bunch for those comments on current and future testing approaches. You're right: while we have good runtime experience at testing those changes, there's currently not a lot in the direction of feeding inputs to the serialization statically. It would be fantastic if you could issue up a couple of those items to tackle later.

@@ -130,7 +130,7 @@ func (fnb *FlowNodeBuilder) BaseFlags() {
func (fnb *FlowNodeBuilder) EnqueueNetworkInit(ctx context.Context) {
fnb.Component("network", func(builder NodeBuilder, node *NodeConfig) (module.ReadyDoneAware, error) {

codec := jsoncodec.NewCodec()
codec := cborcodec.NewCodec()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #1147 to record this for follow-up.

@simonhf simonhf merged commit 8df9a3b into master Aug 16, 2021
@simonhf simonhf deleted the simonhf/5610-optimize-network-message-serialization branch August 16, 2021 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants