Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Views not receiving the traceparent from the event #2041

Open
franciscolopezsancho opened this issue Feb 21, 2024 · 11 comments
Open

bug: Views not receiving the traceparent from the event #2041

franciscolopezsancho opened this issue Feb 21, 2024 · 11 comments

Comments

@franciscolopezsancho
Copy link
Contributor

Screenshot 2024-02-21 at 11 40 13

The View are using the same traceparent from the runtime.

@franciscolopezsancho
Copy link
Contributor Author

franciscolopezsancho commented Feb 21, 2024

Here's what I've found so far:
The flow of data is the following:
The runtime gets the HTTP message, creates the span in Serve, add the traces in ( val grpcProxy = createGrpcApi(components, router, discoveryClient, traceInstrumentation, settings) ) so at that point there's only the traparent from the runtime. It handles the command to the EventSourcedEntity with the runtime traceparent. This sends the request through the stream to the UF and in the SDK we then create another span when runEntity (the flow in the other side) receives the message. Here we set the metadata with the new traceparent, from the new span we pass it's metadata to router._internalHandleCommand that passes it to the class with .invoke (in the SpringSDK) and then we are back to runtime/EventSourcedEntity.receiveCommand but now there is a command (it's a var) in the entity and we move to waitingForUserFunction (not idle as before).
Apart from that EventSourcedEntity.receiveEventSourcedStreamOutEvents uses the command above (from the very same EventSourcedEntity) and the event from the SDK and here's the problem. We need the traceparent from that event but we have a Any not a TraceableAny. So the change can be humongous.

@franciscolopezsancho
Copy link
Contributor Author

My understanding is that we need to change EventSourcedReply in event_source_entity.proto from Any to TraceableAny.
Actually ESE reply and maybe something similar in VE.

@johanandren
Copy link
Contributor

That shown trace looks like the traceparent for the call is propagated, just not the one from the event sourced entity span but the span you start on receiving the call in the grpc/http logic of the runtime.

Doesn't that indicate that you are just using the wrong metadata/traceparent when you create the TraceableAny before persisting it?

@franciscolopezsancho
Copy link
Contributor Author

franciscolopezsancho commented Feb 22, 2024

After some more discussion in the standup. These are the key points as I remember them.

On the one hand we can add the trace info from the SDK and send it to the Runtime:
Pros:
we'll keep the current fine grained trace apart from fixing this issue. That is, we'll get the proper trace parent for Views and also for Actions when consuming events from entities and pushing to a broker.
Cons:
Maybe that's too much fine grained
We have to check if the SDK is sending that info inside the event or even make it compulsory. The latter will add a constraint when building new SDKs.

Another option is to avoid creating own SDK span and only show to the user one big span from when the request hits the proxy until after processing it sends it to the user.
Pros:
Simpler to implement, Not tying the SDK to have to send info back to the Runtime. In any case we need keep sending trace parents from the Runtime to the SDK to allow to start new spans to the user.
Cons:
We lose visibility where do we spend our processing time.

@aludwiko
Copy link
Contributor

Ok, this is the current state:
image

I agree that the user function VE span will be always (?) very small, it should be at least, so it's not very useful information. If we put View and Action (with subscription) under the Value Entity span, then we lose the information that this is an asynchronous process. Here, we know that the request took 74ms, the rest is under the same trace, but it's not a part of the request.

How this would look like in the alternative approach?

@franciscolopezsancho
Copy link
Contributor Author

What I expect when "merging" runtime with the SDK component span, VE will be in the same vertical than Kalix endpoint while Views and Action would stay where they are.
I write "merging" because we'll get rid of the VE (in this case) and only show the Kalix endpoint although with the title VE in.

I think this trace shows very interesting info for us. All the latency we introduce before the actual call to the UF. On the other hand we might not want to show this to the users for the very same reason.

@aludwiko
Copy link
Contributor

I'm not sure if merging kalix endpoint with VE span is a good idea. If we are ok with what we have at this point, I would postpone this decision, collect some feedback, and then think about this again.

@aludwiko
Copy link
Contributor

Somehow related to this discussion, the runtime will retry in the case of a subscription failure, and a new span will be created. Is that what we want to show?
image

@franciscolopezsancho
Copy link
Contributor Author

I'm not sure if merging kalix endpoint with VE span is a good idea. If we are ok with what we have at this point, I would postpone this decision, collect some feedback, and then think about this again.

I definitely would postpone the decision. Agree, let's go what we have now and then we can review this and other issues. What we have is just the very first version.

@franciscolopezsancho
Copy link
Contributor Author

Is that what we want to show?

It doesn't look great although it's interesting to see the back off increasing and decreasing forming the two curves. To have one span with the info of all the retries would be nice but I can think in two scenarios:

  1. All actions fail
    • This could show one span with error and the number of retries somewhere
  2. At the end one action succeeds -
    • This could show the above and then one more span with the successful action.

I don't know if this is technically possible though and again not even sure if the way it is it's just fine.

@efgpinto
Copy link
Member

Somehow related to this discussion, the runtime will retry in the case of a subscription failure, and a new span will be created. Is that what we want to show?

IMO, this looks fine and useful to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants