Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

peers: Track errors across connections #4051

Merged
merged 3 commits into from Mar 17, 2020

Conversation

@carlaKC
Copy link
Collaborator

carlaKC commented Mar 5, 2020

This PR adds tracking of peer errors across peer connections.

  • The latest 10 errors are cached in a circular array in the peer struct (10 is a magic number, seemed reasonable).
  • On disconnection, the peer hands off this cache to the server
  • On reconnection, the server creates the peer struct with their historic error cache
  • The latest error for each peer is displayed in ListPeers; showing the whole list felt like overkill (?) but could do so (since we'd need to add an extra endpoint to expose them otherwise)

Fixes #3000

@carlaKC carlaKC requested review from cfromknecht and Roasbeef as code owners Mar 5, 2020
@carlaKC carlaKC requested review from bhandras and guggero and removed request for Roasbeef and cfromknecht Mar 5, 2020
Copy link
Collaborator

bhandras left a comment

Nice and clean, mostly small comments from my side.

// Circular is a cache which retains a set of values in memory, and
// overwrites the oldest item in the cache when a new item needs to be
// written.
type Circular struct {

This comment has been minimized.

Copy link
@bhandras

bhandras Mar 5, 2020

Collaborator

nit note: this data structure is usually called "ring/circular buffer". A cache is more of a read/write trough data structure and speeds up reading/writing with random access (get/set). Maybe a better following this reasoning is the queue package.

This comment has been minimized.

Copy link
@carlaKC

carlaKC Mar 9, 2020

Author Collaborator

Renamed to buffer :)

Maybe a better following this reasoning is the queue package.

Do you mean move this into the queue package?

cache/cache.go Outdated Show resolved Hide resolved
cache/cache.go Outdated Show resolved Hide resolved
cache/cache.go Outdated Show resolved Hide resolved
peer.go Outdated Show resolved Hide resolved
lnrpc/rpc.proto Show resolved Hide resolved
lnrpc/rpc.swagger.json Outdated Show resolved Hide resolved
lnrpc/rpc.swagger.json Show resolved Hide resolved
lnrpc/rpc.swagger.json Outdated Show resolved Hide resolved
// Create a slice with length and capacity equal to the size of
// the cache so that we do not need to resize the underlying
// array when we add items.
items: make([]interface{}, size),

This comment has been minimized.

Copy link
@bhandras

bhandras Mar 5, 2020

Collaborator

Instead of taking up a fixed size, this could grow as well. Not a big thing as for now we only use this with a handful of peers.

This comment has been minimized.

Copy link
@cfromknecht

cfromknecht Mar 6, 2020

Collaborator

another possibility is to allocate this as make([]interface{}, 0, size) so that it is preallocated, but len still works as expected.

This comment has been minimized.

Copy link
@carlaKC

carlaKC Mar 9, 2020

Author Collaborator

make([]interface{}, 0, size)

Thought of going this route but then you end up having custom logic for the first set of elements you add - need to append for the first size x elements (so that slice len is increased) and then store at index after that. Since we don't really need len for much, I thought it would be better to have more simple add logic?

@roeierez

This comment has been minimized.

Copy link
Contributor

roeierez commented Mar 5, 2020

  • The latest error for each peer is displayed in ListPeers; showing the whole list felt like overkill (?) but could do so (since we'd need to add an extra endpoint to expose them otherwise)

Suggestion: A boolean parameter can be added to the RPC include-all-errors that will add the errors history only if true.
It will eliminate the need to add another RPC.

Copy link
Collaborator

guggero left a comment

Nice PR, this will certainly be very useful!
Not much to add from my side other than sending the whole list of errors instead of just the last one.

cache/cache_test.go Outdated Show resolved Hide resolved
lnrpc/rpc.proto Outdated Show resolved Hide resolved
lnrpc/rpc.proto Show resolved Hide resolved
@Roasbeef Roasbeef added this to the 0.10.0 milestone Mar 5, 2020
@Roasbeef Roasbeef added this to In Progress in v0.10.0-beta via automation Mar 5, 2020
Copy link
Collaborator

cfromknecht left a comment

Well done 🎉Only question I have is whether we should expand the set of messages that we will start caching. The original issue only says to cache lnwire.Error messages, but i see that we also store a few internal errors, e.g. if they send an incorrect message type. There are a few other places where error caching would be useful imo, particularly

  • When exchanging/validating Init messages
  • Anything that goes through Disconnect

Before adding those tho I think we need to decide whether this is meant to be for general debugging, or whether it should be reserved for in-protocol errors. imo i think the debugging tool is more useful so we can help node operators surface actionable errors.

// Create a slice with length and capacity equal to the size of
// the cache so that we do not need to resize the underlying
// array when we add items.
items: make([]interface{}, size),

This comment has been minimized.

Copy link
@cfromknecht

cfromknecht Mar 6, 2020

Collaborator

another possibility is to allocate this as make([]interface{}, 0, size) so that it is preallocated, but len still works as expected.

cache/cache.go Outdated Show resolved Hide resolved
cache/cache.go Outdated Show resolved Hide resolved
server.go Outdated
@@ -3085,6 +3110,10 @@ func (s *server) removePeer(p *peer) {
delete(s.outboundPeers, pubStr)
}

// Copy the peer's error cache across to the server so that we can
// restore it if the peer reconnects.
s.peerErrors[pubStr] = p.errorCache

This comment has been minimized.

Copy link
@cfromknecht

cfromknecht Mar 6, 2020

Collaborator

we could also store this upon initialization where we attempt the lookup

@@ -2791,6 +2812,10 @@ func (s *server) peerConnected(conn net.Conn, connReq *connmgr.ConnReq,

s.addPeer(p)

// Once we have successfully added the peer to the server, we can
// delete the previous error cache from the server.
delete(s.peerErrors, pkStr)

This comment has been minimized.

Copy link
@cfromknecht

cfromknecht Mar 6, 2020

Collaborator

is this what we want to do? won't this wipe the messages if the peer disconnects?

This comment has been minimized.

Copy link
@carlaKC

carlaKC Mar 9, 2020

Author Collaborator

Would it? Here, once the peer has connected, we copy the error buffer across to the peer struct and delete it from the server (no need to maintain 2 copies). If the peer disconnects, we copy it back to the server in removePeer.


message TimestampError{
// The unix timestamp in seconds when the error occurred.
uint64 timestamp = 1;

This comment has been minimized.

Copy link
@cfromknecht

cfromknecht Mar 6, 2020

Collaborator

In other parts of the code base we include a _ns suffix for clarity, i'm not sure if we use _s anywhere though. Maybe the de facto rule is that no suffix means seconds, but idk how well we adhere to that.

@carlaKC carlaKC force-pushed the carlaKC:3000-peererrors branch 3 times, most recently from 33a399f to 879453f Mar 9, 2020
@carlaKC

This comment has been minimized.

Copy link
Collaborator Author

carlaKC commented Mar 9, 2020

A boolean parameter can be added to the RPC include-all-errors that will add the errors history only if true.

@roeierez nice idea! Added this :)

Before adding those tho I think we need to decide whether this is meant to be for general debugging, or whether it should be reserved for in-protocol errors. imo i think the debugging tool is more useful so we can help node operators surface actionable errors.

I think debugging is also the primary use case here. Rather have more errors tracked and then potentially decrease what we track/ make cache size configurable etc.

For init, I'd say track invalid features and DLP required, pretty straightforward.

Disconnect is an interesting one, because I almost think you'd want to have the last disconnection reason around permanently, not have it in a buffer which may get over written after some number of errors. Maybe overkill, but it does seem to be a big debugging issue that people bring up so could be worth adding a dedicated last disconnection error field?

@Roasbeef Roasbeef moved this from In Progress to Review In Progress in v0.10.0-beta Mar 10, 2020
@carlaKC carlaKC force-pushed the carlaKC:3000-peererrors branch 2 times, most recently from b33edb2 to 05cc997 Mar 11, 2020
peer.go Show resolved Hide resolved
@carlaKC carlaKC requested review from bhandras and cfromknecht Mar 12, 2020
@carlaKC carlaKC force-pushed the carlaKC:3000-peererrors branch from 05cc997 to 9ab99ae Mar 12, 2020
@carlaKC carlaKC requested a review from guggero Mar 12, 2020
Copy link
Collaborator

guggero left a comment

LGTM, pending the guarding of the cast 🎉

queue/circular_buf.go Outdated Show resolved Hide resolved
queue/circular_buf.go Outdated Show resolved Hide resolved
queue/circular_buf.go Outdated Show resolved Hide resolved
queue/circular_buf.go Show resolved Hide resolved

// Add the relevant peer errors to our response.
for _, error := range peerErrors {
tsError := error.(*timestampedError)

This comment has been minimized.

Copy link
@guggero

guggero Mar 12, 2020

Collaborator

To avoid a panic if there ever is something other in there, we could just skip the item if the cast is not ok.

This comment has been minimized.

Copy link
@carlaKC

carlaKC Mar 13, 2020

Author Collaborator

If we do make a change which adds something else, I'm pretty confident that the itests will panic. Should be picked up developer end when testing changes anyway, so I'd rather fail loudly?

This comment has been minimized.

Copy link
@guggero

guggero Mar 17, 2020

Collaborator

Yes, failing loudly also sounds good to me. It should never happen but a nice error is always nicer than just a panic.

Copy link
Collaborator

bhandras left a comment

LGTM ⚡️

}

// Total returns the total number of items that have been added to the buffer.
func (c *CircularBuffer) Total() int {

This comment has been minimized.

Copy link
@bhandras

bhandras Mar 12, 2020

Collaborator

Wouldn't it be more useful if this method would return the number of items currently in the buffer instead of total items added?

This comment has been minimized.

Copy link
@carlaKC

carlaKC Mar 13, 2020

Author Collaborator

Kinda useless after you hit the total size, since it'll just always return the size. If we end up needing that, I'll add a Size() func so you can figure out how full it is.

queue/circular_buf.go Show resolved Hide resolved
v0.10.0-beta automation moved this from Review In Progress to Merge Queue Mar 12, 2020
@carlaKC carlaKC force-pushed the carlaKC:3000-peererrors branch from 9ab99ae to c01c35b Mar 13, 2020
@carlaKC

This comment has been minimized.

Copy link
Collaborator Author

carlaKC commented Mar 13, 2020

Rebased and updated some nits, going to wait on the change from @Crypt-iQ before merge, just so I can do one last set of tests.

@Roasbeef

This comment has been minimized.

Copy link
Member

Roasbeef commented Mar 17, 2020

Needs rebase. @Crypt-iQ's PR referenced hasn't yet entered the review cycle yet, so I think this can land, with that PR being updating the referenced area as needed.

carlaKC added 3 commits Mar 17, 2020
This commit introduces a fixed size circular buffer  which stores
elements in a fixed size underlying array, wrapping to overwrite items
when the buffer gets full.
Add an error buffer to the peer struct which will store errors for
peers that we have active channels with. We do not store these errors
with peers that we do not have channels open with to prevent peers from
connecting and costlessly spamming us with error messages. When the peer
disconnects, the error buffer is offloaded to the server so that we can
track errors across connections. When peers reconnect, they are created
with their historic error buffer.
This change adds a set of errors to the peer struct returned by list
peers. A latest error boolean is added to allow for more succinct
default lncli responses.
@carlaKC carlaKC force-pushed the carlaKC:3000-peererrors branch from c01c35b to 4c48d03 Mar 17, 2020
@carlaKC carlaKC merged commit b03c3c2 into lightningnetwork:master Mar 17, 2020
1 of 2 checks passed
1 of 2 checks passed
coverage/coveralls Coverage decreased (-0.04%) to 62.988%
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
v0.10.0-beta automation moved this from Merge Queue to Done Mar 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
v0.10.0-beta
  
Done
Linked issues

Successfully merging this pull request may close these issues.

7 participants
You can’t perform that action at this time.