Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add basic tracing for responses #291

Merged
merged 4 commits into from
Dec 11, 2021
Merged

feat: add basic tracing for responses #291

merged 4 commits into from
Dec 11, 2021

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Dec 2, 2021

This works so far for the responseManager.ProcessRequests() path, but it's pretty complicated; with the two main problems (building on from what I mentioned @ https://github.com/ipfs/go-graphsync/pull/290/files#r760950285 about the branching inside GraphSync#ReceiveMessage() and wanting to have a base span that lives for the entirety of an incoming message there:

  • Even if taking the responseManager path in isolation, each "message" can have multiple requests, each of which becomes detached from the others for processing—so making a "message" span live across all of these is going to have to take some tricky tracking, or a special wrapper that can decrement calls to "End" and call a true "End" when it reaches zero.
  • The fact that an incoming message can have two paths - one to responseManager and one to requestManager, with the latter having a fairly complicated journey, doubles the pain of having a parent "message" span having a discrete end.

Perhaps a parent "message" span isn't what we want, and we should just have a parent "request" span for each of the requests that an individual message can trigger? Which is a bit unfortunate because then we can't have a span that lives for the lifetime of a call through the top-level GraphSync call, but maybe that's OK?

Currently I have a parent "message" span live until all child "request" spans have been started, so there'll be a parent->child relationship, but the parent won't live much longer after the child is started. I bet different trace viewing tools will deal with that in their own unique way but I reckon it'll still retain the hierarchy.

@rvagg
Copy link
Member Author

rvagg commented Dec 2, 2021

The test failures are all from the original tracing testing not understanding these new traces, but they're instructive to look at to see what's been added, e.g.:

TestGraphsyncRoundTrip:

          	            	 (string) (len=38) "message(0)->request(0)->executeTask(0)",
          	            	 (string) (len=10) "message(1)",
          	            	 (string) (len=25) "request(0)->newRequest(0)",
          	            	 (string) (len=31) "request(0)->terminateRequest(0)",
          	            	 (string) (len=26) "request(0)->executeTask(0)"

TestGraphsyncRoundTripRequestBudgetRequestor:

          	            	 (string) (len=39) "message(0)->request(0)->abortRequest(0)",
          	            	 (string) (len=25) "request(0)->newRequest(0)",
          	            	 (string) (len=31) "request(0)->terminateRequest(0)",
          	            	 (string) (len=26) "request(0)->executeTask(0)",
          	            	 (string) (len=10) "message(1)"

TestNetworkDisconnect:

          	            	 (string) (len=38) "message(0)->request(0)->executeTask(0)",
          	            	 (string) (len=39) "message(0)->request(0)->abortRequest(0)",
          	            	 (string) (len=38) "message(0)->request(0)->executeTask(1)",
          	            	 (string) (len=25) "request(0)->newRequest(0)",
          	            	 (string) (len=31) "request(0)->terminateRequest(0)",
          	            	 (string) (len=26) "request(0)->executeTask(0)"

And TestGraphsyncRoundTripIgnoreNBlocks has lots of message spans that go into the (so far) un traced requestManager.ProcessResponses() call.

@hannahhoward
Copy link
Collaborator

hannahhoward commented Dec 3, 2021

@rvagg I think we had a bit of confusion her. Yes, we should not make the message span the parent span. That will create a LOT of complexity and that's not the intention

In fact, we can ignore "message spans" entirely. BTW, the context we get on ReceiveMessage would have more value if it had existing existing spans associated with it, but it has nothing: https://github.com/ipfs/go-graphsync/blob/main/network/libp2p_impl.go#L149-L151

We should track only individual spans for requests, at least for now, and they should simply start off ResponseManager.ctx for now, inside of ResponseManager.processRequests, and only when it's a new request.

We do need a better approach eventually to link this up to data transfer spans, but that's a different ticket. I'll submit an issue shortly with how we can deal with this aspect for the time being

@hannahhoward
Copy link
Collaborator

hannahhoward commented Dec 3, 2021

Essentially, this should start with no message span at all:

ctx, requestSpan := otel.Tracer("graphsync").Start(trace.ContextWithSpan(rm.ctx, messageSpan), "request", trace.WithAttributes(

and it should move here:
rm.connManager.Protect(p, request.ID().Tag())
so we're ignoring cancels and updates, which shouldn't start a new span.

BTW if we do want to do message spans, we should simply use "Links" to connect them to request spans, rather than creating the requestSpan's off of them. I'm not sure whether this is something we should tackle now and I really haven't messes with OT Links just yet. But according to the docks, this is exactly what you are supposed to use Links for: https://pkg.go.dev/go.opentelemetry.io/otel/trace#Link

Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

@rvagg
Copy link
Member Author

rvagg commented Dec 8, 2021

I'm going to have to abandon this for the evening, sorry I didn't get it done by EOD in time for a release @hannahhoward but it's really close and the traces are already pretty interesting (see impl/graphsync_test.go for summaries).

The main hang-up now is in the variability of traces when observing from the tests. I'm pretty sure it's reasonable variability, getting one or more events queuing depending on the run, and I don't think I can control some of these things but need to account for that variability in the tests. I already have one test doing this where I know the variability comes from the network layer triggering multiple error events, I just need to walk through the other ones that are variable to make sure I understand why so I'm not ignoring dodgy conditions.

I also think I need to be recording pause events on a span somewhere when they happen, and maybe unpause too.

The TODOs in the code represent my outstanding tasks.

Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, various questions. Again, where the request actually ends is a bit complicated, and some of the calls are in the wrong place. See the issue for reference: #288

responsemanager/server.go Outdated Show resolved Hide resolved
responsemanager/server.go Outdated Show resolved Hide resolved
responsemanager/server.go Outdated Show resolved Hide resolved
responsemanager/server.go Outdated Show resolved Hide resolved
responsemanager/server.go Outdated Show resolved Hide resolved
@rvagg rvagg force-pushed the rvagg/tracing branch 4 times, most recently from 86fc465 to 6a28a13 Compare December 9, 2021 11:12
@rvagg
Copy link
Member Author

rvagg commented Dec 9, 2021

@hannahhoward I'm slightly stuck on this now. Your response-end work has been helpful, but now I'm running into a lot of variability on the receiver side in some of the tests. Sometimes the spans show up and sometimes they don't. I don't know whether this is because those paths are not executed or because I'm capturing traces before everything's closed down and the traces have ended.

I'll keep on trying to debug this to work out why, but if the answers are obvious to you then give me some hints. I'm suspecting network-related variability that I need to accept, but it could also be synchronization issues and perhaps even actual bugs. So I'd like to figure that out to know whether I need to tighten something up, or just give up on those assertions.

The concerning tests with TODOs in them are:

  • TestPauseResume
  • TestPauseResumeRequest
  • TestGraphsyncRoundTripRequestBudgetRequestor

In CI now, I can see these failing fairly reliably, not even recording a response at all, so I suppose that's a speed thing. And TestPauseResumeViaUpdateOnBlockHook is in the list too. 🤦

There's also an outstanding TODO in TestGraphsyncRoundTripAlternatePersistenceAndNodes where the recorded error on the request trace can either be a SkipMe or a ContextCancelError, which is a little concerning.

Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple comments but this is looking good. I will take a look at the testing issues now

responsemanager/client.go Show resolved Hide resolved
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
response.span.RecordError(err)
response.span.SetStatus(codes.Error, err.Error())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dunno what good span error conventions are but should we record this twice at two different levels?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't know and wanted to discuss this with you. This particular one is a bit tricky, others are a little more clear I think that "this error belongs to the response" but here there's span for something called "abort" and as a user I'd expect to see what that abort was about on that span, but the error is also responsible for closing the request.
I could go either way on this, and threw up my hands and said why not both? 🌮

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all good by me :)

@rvagg rvagg force-pushed the rvagg/tracing branch 4 times, most recently from 218092a to b219119 Compare December 10, 2021 05:39
@rvagg rvagg marked this pull request as ready for review December 10, 2021 05:40
@rvagg
Copy link
Member Author

rvagg commented Dec 10, 2021

I think this is in a decent state now for a proper review @hannahhoward when you have time.

@rvagg rvagg changed the title feat: add basic tracing for responses (WIP) feat: add basic tracing for responses Dec 10, 2021
Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small nits to consider fixing before merge, but not blocking enough I want to force a request changes cycle :)

attribute.Int("id", int(update.ID())),
attribute.Int("priority", int(update.Priority())),
attribute.String("root", update.Root().String()),
attribute.Bool("isCancel", update.IsCancel()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these needed? getting to this code path requires that isCancel is always false and isUpdate is always true... seems not needed to record a value that is always the same.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, all of these values EXCEPT the extensions should be contained in the parent span. basically, an update request ONLY contains extensions to update the request plus an id to identify the request (plus isUpdate to identify it's an update).

This is actually a pretty confusing part of the current protobuf format:

new, cancel, and update share a struct type: Request

However, Cancel is basically:

{
id: id,
isCancel: true
}

and update is:

{
id: id,
isUpdate: true,
extensions: extensions
}

All the other fields are only set for new requests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only mentioning in case we decide we want to make this clearer, given that we're about to do big protocol changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, all very good to know. What I've done: removed all but "extensions" from the processUpdate span, also removed "isCancel" and "isUpdate" from the request span.

It looks like we can assume an "isUpdate" if we get a "processUpdate" span, but I don't know if we can assume an "isCancel" as clearly since abortRequest() can be called from both CloseWithNetworkError() and CancelResponse() as well as from the plain processRequests() path. I think that an ipldutil.ContextCancelError in the error of the span should differentiate the different paths and let you assume that it's an "isCancel".

But also, maybe this is a task for later when we get some tracing closer up to the message layer where we can lay out more specifics about the chatter.

responsemanager/server.go Outdated Show resolved Hide resolved
@hannahhoward hannahhoward merged commit 02173d5 into main Dec 11, 2021
hannahhoward added a commit that referenced this pull request Dec 11, 2021
fix(impl): remove accidental legacy field (#310)

feat: add basic tracing for responses (#291)

* feat: add basic tracing for responses

* fix: add assertions for response completion

* fix: improve test synchronization/drainage

* fix(responsemanager): remove uninformative span flags

Co-authored-by: hannahhoward <hannah@hannahhoward.net>
@rvagg rvagg deleted the rvagg/tracing branch December 13, 2021 02:44
@rvagg
Copy link
Member Author

rvagg commented Dec 13, 2021

oh, you already did it! didn't get to the bottom of this issue before I tried to push some changes.
All good then

marten-seemann pushed a commit that referenced this pull request Mar 2, 2023
* feat(channels): add missing cids field

* feat(graphsync): handle missings cids via events, integration test

* style(lint): mod tidy

* feat(channels): put retrievals with missing CIDs in "partially complete"

* style(graphsync): add explanatory comment

* feat(deps): update to v0.12.0 graphsync

* ci(circle): update to go 1.16

* build(go.mod): update to go 1.16
marten-seemann pushed a commit that referenced this pull request Mar 2, 2023
…294)

This reverts commit b442027.
(still contains tracing improvements and module dep updates)

Co-authored-by: Jiaying Wang <42981373+jennijuju@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants