Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simonhf/5589 binstat efficient stats in bins #891

Merged
merged 60 commits into from
Jul 30, 2021

Conversation

simonhf
Copy link
Contributor

@simonhf simonhf commented Jun 24, 2021

Implement WIP MVP binstat.
Implement binstat instrumentation for net, stdmap, and MakeID.
Optimize stdmap .Add() function.
Prep CBOR for network messages (but JSON still default for now).

@codecov-commenter
Copy link

codecov-commenter commented Jun 24, 2021

Codecov Report

Merging #891 (752cd4a) into master (5764ee7) will decrease coverage by 0.04%.
The diff coverage is 77.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #891      +/-   ##
==========================================
- Coverage   53.35%   53.30%   -0.05%     
==========================================
  Files         318      318              
  Lines       21506    21511       +5     
==========================================
- Hits        11474    11467       -7     
- Misses       8463     8477      +14     
+ Partials     1569     1567       -2     
Flag Coverage Δ
unittests 53.30% <77.77%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
module/mempool/stdmap/blockbycollections.go 0.00% <0.00%> (ø)
module/mempool/stdmap/queues.go 0.00% <0.00%> (ø)
network/p2p/middleware.go 0.00% <ø> (ø)
network/p2p/network.go 0.00% <ø> (ø)
network/p2p/readSubscription.go 0.00% <ø> (ø)
module/mempool/stdmap/backend.go 91.20% <75.00%> (-0.84%) ⬇️
engine/execution/ingestion/engine.go 57.31% <100.00%> (+0.17%) ⬆️
module/mempool/stdmap/incorporated_result_seals.go 80.82% <100.00%> (ø)
network/test/testUtil.go 95.72% <100.00%> (ø)
...sus/approvals/assignment_collector_statemachine.go 40.38% <0.00%> (-9.62%) ⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5764ee7...752cd4a. Read the comment docs.

binstat/README.md Outdated Show resolved Hide resolved
@vishalchangrani
Copy link
Contributor

@simonhf I think this tool is amazing! I also think it should be part of a separate repo (maybe under your own public repo) which we can then import into flow. That way you can have other folks use it more easily, contribute to it and spread the love ❤️

@simonhf
Copy link
Contributor Author

simonhf commented Jun 24, 2021

@simonhf I think this tool is amazing! I also think it should be part of a separate repo (maybe under your own public repo) which we can then import into flow. That way you can have other folks use it more easily, contribute to it and spread the love ❤️

@vishalchangrani thanks :-) Yeah, I put it here initially under the guidance from @turbolent (who I asked to review this as well) but am happy to move it if there is common consensus, and maybe it's something to discuss here now, or later? :-) Also, putting it here initially probably also makes it easier to do reviews, etc, while it's in its early formative stages of development?

Copy link
Contributor

@vishalchangrani vishalchangrani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@turbolent turbolent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

I mainly just looked at the Go code, I'm not familiar with the business logic

network/codec/json/decode.go Outdated Show resolved Hide resolved
binstat/binstat.go Outdated Show resolved Hide resolved
binstat/binstat.go Outdated Show resolved Hide resolved
binstat/binstat.go Outdated Show resolved Hide resolved
Comment on lines 470 to 490
// functions BEFORE go fmt v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v

/*
func New (what string, callerParams string ) *BinStat { return newGeneric(what, callerParams, false, 0 , sizeNotUsed, true ) }
func newInternal (what string ) *BinStat { return newGeneric(what, "" , false, 0 , sizeNotUsed, false) }
func NewTime (what string, callerParams string ) *BinStat { return newGeneric(what, callerParams, true , 0 , sizeNotUsed, true ) }
func newTimeInternal (what string ) *BinStat { return newGeneric(what, "" , true , 0 , sizeNotUsed, false) }
func NewTimeVal (what string, callerParams string, callerSize int64) *BinStat { return newGeneric(what, callerParams, true , callerSize, sizeAtEnter, true ) }
func newTimeValInternal(what string, callerSize int64) *BinStat { return newGeneric(what, "" , true , callerSize, sizeAtEnter, false) }

func Pnt (p *BinStat, pointUnique string ) time.Duration { return pntGeneric(p , pointUnique , 0 , sizeNotUsed, true ) }
func pntInternal (p *BinStat, pointUnique string ) time.Duration { return pntGeneric(p , pointUnique , 0 , sizeNotUsed, false) }
func End (p *BinStat ) time.Duration { return pntGeneric(p , "Leave" , 0 , sizeNotUsed, true ) }
func endInternal (p *BinStat ) time.Duration { return pntGeneric(p , "Leave" , 0 , sizeNotUsed, false) }
func EndVal (p *BinStat, callerSize int64) time.Duration { return pntGeneric(p , "Leave" , callerSize, sizeAtLeave, true ) }
func endValInternal (p *BinStat, callerSize int64) time.Duration { return pntGeneric(p , "Leave" , callerSize, sizeAtLeave, false) }

func Dbg (p *BinStat, debugText string ) { dbgGeneric(p , debugText , true ) }
*/

// functions W/&W/O go fmt ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just remove this, the common convention is to just commit the result of go fmt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment! The reason why I left in the comment is for code comprehension, because there is a 'pattern' to the arguments etc which can be easily seen with a glance when looking at the white space aligned comment, but is not obvious even with a longer look at the go fmt version of the functions below the comment. I tried researching for a better alternative -- because I'm not super happy about this either -- but I couldn't find one :-(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@turbolent another idea might be to export the comment to a separate file in the same directory, e.g. binstat.txt?

Another idea might be to put the comment into a gist and just have a comment pointing to the gist?

I dislike the last suggestion the most, the separate file suggestion the next, and the as-is comment the least. But all options are annoying... do you have any more ideas apart from just deleting the comment? :-)

binstat/binstat_external_test.go Outdated Show resolved Hide resolved
binstat/binstat.go Outdated Show resolved Hide resolved
binstat/binstat.go Outdated Show resolved Hide resolved
binstat/binstat.go Outdated Show resolved Hide resolved
@simonhf
Copy link
Contributor Author

simonhf commented Jun 25, 2021

Great work!

I mainly just looked at the Go code, I'm not familiar with the business logic

@turbolent thanks! And thanks very much for the comments which I will fix shortly, except for the couple I commented on for discussion.

If you're interested in the business logic then I included a README file with biz logic explanations, and also one of the tests shows a comparison between binstat and pprof on the same bit of example code.

Also, @vishalchangrani suggested (above) moving the package to a private repo. I don't mind either way. But it might be easier to leave it here -- as you kindly originally suggested -- for now for easier reviews, etc? We can always move it out later...?

@simonhf simonhf requested a review from turbolent June 25, 2021 02:59
@simonhf
Copy link
Contributor Author

simonhf commented Jul 17, 2021

Trying to replicate the unit test failure on local machine. Ran the engine/execution/ingestion unit tests 10x unit got a failure which was "fatal error: concurrent map writes" at this [1] line. Not sure if this is a general issue or caused by my changes? @m4ksio thinks @yhassanzadeh13 might have fixed the same error in another test. @yhassanzadeh13 thinks it might be a flaky test. So I tried restarting the tests... Second run then THAT test passed but there was a different failure:

--- FAIL: TestNotifier_AllWorkProcessed (0.00s)
    --- FAIL: TestNotifier_AllWorkProcessed/run_15 (3.04s)
        notifier_test.go:166: 
            	Error Trace:	notifier_test.go:166
            	Error:      	Condition never satisfied
            	Test:       	TestNotifier_AllWorkProcessed/run_15
            	Messages:   	only consumed 0 units of work but expecting 100
FAIL
coverage: 66.7% of statements
FAIL	github.com/onflow/flow-go/engine	45.688s

Running a 3rd time... :-) And it worked the 3rd time :-)

[1] https://github.com/onflow/flow-go/blame/simonhf/5589-binstat-efficient-stats-in-bins/engine/execution/ingestion/engine_test.go#L306

Copy link
Contributor

@yhassanzadeh13 yhassanzadeh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonhf I appreciate your time and efforts in applying the comments. Although the code looks logically sound, I still have the same concerns regarding its maintainability, encapsulation, safety, and readability. I tried to shed more light on my comments by adding more details/replies.

network/codec/json/decode.go Outdated Show resolved Hide resolved
network/codec/json/encode.go Outdated Show resolved Hide resolved
network/codec/roundTripHeader_test.go Show resolved Hide resolved
data, err := msg.Marshal()
var data []byte
var err error
bs := binstat.EnterTime("~3net:wire<4message2protobuf")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still recommend having these string arguments as constants similar to what we did here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhassanzadeh13 thanks for the comment! I have now changed the root of the string to a constant because the roots are used often. It's also easy to grep the constants and this is the recommended way to use the new binstat tool to uncomment / unhide instrumentation for a particular bin root.

utils/binstat/binstat.go Outdated Show resolved Hide resolved
@@ -53,7 +54,11 @@ func (b *Queues) Get(queueID flow.Identifier) (*queue.Queue, bool) {
}

func (b *Queues) Run(f func(backdata *QueuesBackdata) error) error {
p1 := binstat.EnterTime("~4lock:w:Backend.Run(Queues)", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonhf thanks for your reply, I think having them either on const or using string modifiers is still safer than inline strings like what we have here.

binstat.Fin()

// cat and sort binstat stats file
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest encapsulating the similar parts into a separate test function and invoke it at both places with proper parameters.

command := "ls -al ./binstat.test.pid-*.binstat.txt ; cat ./binstat.test.pid-*.binstat.txt | sort --version-sort"
out, err := exec.Command("bash", "-c", command).Output()
require.NoError(t, err)
zlog.Debug().Msgf("test: output of command: %s\n%s", command, out)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think going with the suggested pattern is more readable and maintainable over time.

@yhassanzadeh13 yhassanzadeh13 self-requested a review July 19, 2021 23:54
Copy link
Contributor

@yhassanzadeh13 yhassanzadeh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow-up on my review, since this PR already has gone through lots of reviews by other team members and got enough approvals for merging, I change my feedback from "Request Changes" to "Comment" (i.e., submitting general feedback without explicit approval). In this way, my comments are not a blocker for merging, and you consider them at your discretion or may move forward with the merge.

@yhassanzadeh13 yhassanzadeh13 dismissed their stale review July 20, 2021 01:09

changing to comment

@simonhf
Copy link
Contributor Author

simonhf commented Jul 24, 2021

@yhassanzadeh13 thank you so much for all your valuable comments. I feel that I have learned so much through this process. The PR is now up-to-date and all comments should have been addressed unless I forgot one which I don't think I did :-) Also, the tests are passing and the latest master is merged too! Looking forward to any more comments! Thanks!

@simonhf
Copy link
Contributor Author

simonhf commented Jul 27, 2021

@yhassanzadeh13 added some minor helper functions which ended up being useful while analyzing compression on network CBOR messages, and might be useful for similar future work.

Copy link
Contributor

@yhassanzadeh13 yhassanzadeh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great 👏 , I just added a few comments about code styling, which appreciate applying them before merge.

Comment on lines +52 to +54
// DKG
CodeDKGMessage

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this codec is still not in action, but it is a good practice always to append codes at the end of this enum, otherwise it may interfere with existing codes in action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhassanzadeh13 thanks for the comment!

FYI I believe CodeDKGMessage got merged yesterday into master [1] by Jordan! I merged master into this PR branch later yesterday and that caused integration issues for me. Why? Because there is a new function which looks up the human readable name for the code and it didn't exist yet for master or this branch. But I fixed that yesterday evening.

[1] https://github.com/onflow/flow-go/blob/master/network/codec/json/envelope.go#L56

options.Time = cbor.TimeRFC3339Nano // option needed for wanted time format
encMode, err := options.EncMode()
if err != nil {
panic(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this same code snipped replaced several places. In order to be DRY (Don't Repeat Yourself), I suggest the following workaround:

  • Moving this function as an exported function to network/codec/cbor/encode. In general, we need to bubble up the errors and let the caller decide on what to do with error handling case instead of panic upon an error:
    func EncMode() (cbor.EncMode, error) {
     options := cbor.CoreDetEncOptions() // CBOR deterministic options
     // default: "2021-07-06 21:20:00 +0000 UTC" <- unwanted
     // option : "2021-07-06 21:20:00.820603 +0000 UTC" <- wanted
     options.Time = cbor.TimeRFC3339Nano // option needed for wanted time format
     encMode, err := options.EncMode()
     if err != nil {
     	return nil, fmt.Errorf("could not extract encoding mode: %w", err)
     }
     return encMode, nil
    

}

- In our tests, we use `require` method of `testify` to terminate the test upon an error:

import "github.com/stretchr/testify/require"

encMode, err := EncMode()
require.NoError(t, err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhassanzadeh13 thanks for the comment!

I refactored the code so that the block of code only exists once instead of three times. This was a useful exercise because the first two refactors failed due to Golang circular dependencies, which was my first tangle with this particular error :-)

err = msgpack.Unmarshal(b, outer)
t.Logf("- debug: outer=%+v <-- after .Unmarshal()\n", outer)
require.Nil(t, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require.NoError(t, err) is a better alternative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhassanzadeh13 thanks for the comment!

Yep, I just copy and pasted the code already there for the existing tests for the other serialization formats. Now, I mercilessly refactored the whole file to use require.NoError() instead of require.Nil() :-)

return fmt.Errorf("invalid message envelope code: %d", code)
}

data = append(data, code)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking don't we need any delimiter to demarcate the data from code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhassanzadeh13 thanks for the comment!

FYI in theory then no delimiter is needed. Why? Because it's just an array of bytes and the code byte is always the last byte. That means upon decoding then the last byte is always the code byte and the CBOR bytes to become unmarshalled are all the other bytes. This is an optimization for CBOR and possible because everything is just an array of bytes. Whereas, the JSON code does an expensive second marshal and second unmarshal to wrap the code in an 'envelope'.

As a side note, it turns out that appending the code is not for free either, although much faster than doing the second marshal with the envelope technique. However, I have discovered a yet faster way -- documented in Notion -- to include the code and intend to push that optimization with a future PR which actually enables CBOR.

Comment on lines +3 to +7
const (
BinNet = "~3net"
BinStdmap = "~6stdmap"
BinMakeID = "~6MakeID"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

defer b.RUnlock()
has := b.Backdata.Has(entityID)
return has
}

// Add adds the given item to the pool.
func (b *Backend) Add(entity flow.Entity) bool {
//bs0 := binstat.EnterTime(binstat.BinStdmap + ".<<lock.(Backend)Add")
entityID := entity.ID() // this expensive operation done OUTSIDE of lock :-)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

@simonhf
Copy link
Contributor Author

simonhf commented Jul 30, 2021

The final message before merging:

So all the CI tests passed.

Then I merged master into this branch and all the CI tests passed again :-)

Then I ran a load test on my benchnet dev box but came up against this recent loader bug [1].

I did the workaround suggested in the issue [1] and the loader started! But I saw many new unexpected node warnings and errors (I added them to the issue [1]). And sometimes the nodes just stop after a fatal regarding epochs. I added that to the issue [1] too.

Finally out of paranoia -- too rule out that this PR has anything to do with the issue [1] and the extra warnings, errors, and fatals -- I run the loader against the latest master without this PR and it did all the same stuff.

So finally I'm hitting 'Merge pull request'... :-)

[1] https://github.com/dapperlabs/flow-go/issues/5709

@simonhf simonhf merged commit 56b6596 into master Jul 30, 2021
@simonhf simonhf deleted the simonhf/5589-binstat-efficient-stats-in-bins branch July 30, 2021 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants