Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please (re)-allow recording links after Span creation time #454

Closed
eyjohn opened this issue Feb 11, 2020 · 37 comments · Fixed by #3678
Closed

Please (re)-allow recording links after Span creation time #454

eyjohn opened this issue Feb 11, 2020 · 37 comments · Fixed by #3678
Assignees
Labels
area:api Cross language API specification issue area:sampling Related to trace sampling area:span-relationships Related to span relationships release:after-ga Not required before GA release, and not going to work on before GA spec:trace Related to the specification/trace directory

Comments

@eyjohn
Copy link

eyjohn commented Feb 11, 2020

It seems that during #258, Adding links after Span creation was removed due to complexities around samplers. While I do appreciate the complexity, I wanted to highlight a scenario where this feature is needed and suggest a potential solution for how this can be accommodated, partially addressing the sampling decision issue.

Example Scenario

Consider a system that waits for several events and processes them as a batch, for example for a subscription in a real-time system. These are stored in a Queue and each event has its OWN injected SpanContext. These are then consumed in a single flush() Operations.

First event arrives

Queue=[E1(SpanContext1)] and a flush() is scheduled for the Subscription

Several events arrive in the meantime

Queue=[E1(SpanContext1),E2(SpanContext2),E3(SpanContext3)]

flush() is invoked

A Span gets created for the flushing of the events. In this case, I consider that the first event triggered the flush and hence makes sense to be the parent. So I might do something like:

message = queue.pop();
span = tracer.CreateSpan(name="flush", parent=message.span_context)
do {
  do_something(message);
  span.AddLink(message.span_context); // This might not be needed since the first is also parent
  if ( /* early escape condition */ ) { // Lets say downstream is full
    if (!queue.empty()) {
      enqueueNextFlush(); // still messages to flush
    }
    break; 
  }
} while ( message = queue.pop() );

Let's say the escape condition was triggered after the second message... I expect the following Spans:
Span1(name="flush", parent="SpanContext1", links=["SpanContext1", "SpanContext2"])
Span2(name="flush", parent="SpanContext3", links=["SpanContext3"])

Additional Notes:

In this case, let's say that our events (E1, E2) are "compressed/conflated" and it no longer makes sense to track them independently.

Also, it is not desired for the queue to be consumed ahead of processing (and for that matter, the escape condition may not yet be known until the processing).

Possible Solutions for the sampling decision problem

Consider only the parent as guaranteed at creation.

The parent is necessary for traceId propagation already, and inheriting the default sampling decision would normally make sense purely from the parent. We can, by convention, recommend that sampling decisions should only consider this link in the sampling decision, although nothing will prevent implementations from using links if they are available in the API.

Allow late links with some limitations.

The limitations can be clearly documented, such as the inability to consider them for sampling decisions and any future use case that is affected by this behaviour.

Consider mutable sampling decision

This suggestion goes way beyond the scope of this feature, but its worth considering its impact on this requirement.

Consider the ability for tracers to "reconsider" the sampling decision as Span events (in this case any change to the span, e.g. operation_name, attributes, events, links are changed). This ultimately creates further complexities with propagation which may enforce constraints such as changes to sampling properties only takes effect from that point onwards. I'm happy to explore this further in a separate issue.

@bogdandrutu bogdandrutu added area:api Cross language API specification issue area:sampling Related to trace sampling spec:trace Related to the specification/trace directory labels Jun 26, 2020
@carlosalberto carlosalberto added the release:after-ga Not required before GA release, and not going to work on before GA label Jul 2, 2020
@Oberon00 Oberon00 added the area:span-relationships Related to span relationships label Aug 19, 2020
@pyohannes
Copy link
Contributor

In today's messaging workgroup, we came across scenarios where we would need to add span links after span creation. The scenario is the following:

We have a "Receive" span that receives several messages from a broker and blocks, and we want this "Receive" span to link to the context of all received messages. We cannot delay the span creation, we want to have the "Receive" span as active span, so that other operations that happen during receive (e. g. HTTP requests spans) can be correctly parented to it. So, we need to start the "Receive" span, and then add links to messages as we receive those.

This is not possible with the current model where links can only be added at span creation. Therefore, we'd like to propose to allow creating links once the span is created ("Consider only the parent as guaranteed at creation" from the above-mentioned possible solutions).

Preventing adding links after span creation removes some complexities around link-based sampling, but on the other hand makes links less useful and (as in our case) would result in less intuitive traces. How do people feel about this trade-off?

@tedsuo
Copy link
Contributor

tedsuo commented Jan 13, 2022

+1

The more we model with links, the more we see the need to add links at any time during a span's lifecycle. In this regard, linking very similar to adding child spans.

I understand that links have a SampledFlag which could be useful for sampling, much in the same way that Span Name can be useful. But the practical reality is that the links are not always present when a span begins.

(As an aside, it is is not actually clear how up-front sampling should make use of the SampledFlag contained in links. It seems like the more links you add, the more the span sampling skews towards "always sample" or "never sample.")

@pyohannes
Copy link
Contributor

Discussed in today's spec meeting.

We agreed to allow creating links after span creation, a PR will be submitted.

@eyjohn
Copy link
Author

eyjohn commented Jan 19, 2022

Great to hear that this got some motion! looking forward to viewing the PR

@bogdandrutu
Copy link
Member

We cannot delay the span creation, we want to have the "Receive" span as active span, so that other operations that happen during receive (e. g. HTTP requests spans) can be correctly parented to it. So, we need to start the "Receive" span, and then add links to messages as we receive those.

Very interesting example, but if an HTTP request happens before knowing what "linked spans" are you processed, it means that the "HTTP request" is not conceptually part of that Span because you made it without needing any information from the parents? I have a hard time understanding what meaningful work can you do before knowing the "input" of the work.

@pyohannes I missed that part of the meeting, can you summarize what was the discussion?

Preventing adding links after span creation removes some complexities around link-based sampling, but on the other hand makes links less useful and (as in our case) would result in less intuitive traces. How do people feel about this trade-off?

I don't think it removes "complexity", it actually causes overhead because you need to always record everything since you never know that at any point links may come.

@bogdandrutu
Copy link
Member

Consider the ability for tracers to "reconsider" the sampling decision as Span events (in this case any change to the span, e.g. operation_name, attributes, events, links are changed). This ultimately creates further complexities with propagation which may enforce constraints such as changes to sampling properties only takes effect from that point onwards. I'm happy to explore this further in a separate issue.

I think it is important to be considered from the beginning, since without this things will no longer work as expected.

@yurishkuro
Copy link
Member

I've read both examples and and the proposed modeling doesn't make sense to me. I think we should be more prescriptive about what it means to have a link. I don't think it's a good idea to use them for just random associations, links should be used to capture causality. In the second example, if the processing span starts before receiving subsequent events, it means that there is causality between starting that span and the following event (but quite possible there is causality with its children where the links cab be recorded on span creation).

@pyohannes
Copy link
Contributor

Very interesting example, but if an HTTP request happens before knowing what "linked spans" are you processed, it means that the "HTTP request" is not conceptually part of that Span because you made it without needing any information from the parents? I have a hard time understanding what meaningful work can you do before knowing the "input" of the work.

For messaging, there's a use-case when a consumer initiates a "Receive" operation, during which several messages are received and returned. We'd like to have a single "Receive" span covering this operation, which links to all spans that were received (or cached) and returned. This is what people want for many messaging scenarios, without being overwhelmed by details of the underlying transport (AMQP or MQTT related spans).

The context information we want to use for linking will be passed as part of the message (not via standard context propagation mechanisms). Thus, we will receive a message (via AMQP, HTTP, MQTT, or whatever), then extract context information from the message metadata, and only then we will be able to create a link. This will allow us to link spans on the consumer side to spans on the producer side (regardless of what the broker does in the middle). This is already described in a draft OTEP in more detail.

I don't think it removes "complexity", it actually causes overhead because you need to always record everything since you never know that at any point links may come.

You're right there, it adds some complexities to sampling, but it removes complexities around writing instrumentation. For our messaging scenario, with the current limitations we would need to add one or more "dummy" spans to the "Receive" span, so we can add the links to producer spans.

@pyohannes I missed that part of the meeting, can you summarize what was the discussion?

During the discussion there were no objections to removing the restriction. @tedsuo was in favor of it, @jmacd remarked that it doesn't add any complexities to sampling that aren't already present with attributes (which also can be added after span creation).

@tedsuo
Copy link
Contributor

tedsuo commented Jan 20, 2022

At this point, I think it would be helpful for the messaging SIG to produce a clear presentation of our proposed model, and have the TC/spec group review it.

Presentation should include:

  • Written description of the model (slides or document)
  • Clear diagrams describing the common scenarios, as well as the edge cases, which the model attempts to address
  • A prototype of the proposed model which shows these same scenarios in code, and also identifies any points where the current version Otel creates a roadblock – requiring either a spec or a model revision.

@pyohannes has generously offered to put this presentation together. @bogdandrutu @yurishkuro when it's ready we'll try to schedule a time to present it to the TC (ideally at the spec meeting, if you both can attend).

@eyjohn
Copy link
Author

eyjohn commented Jan 21, 2022

@yurishkuro

links should be used to capture causality

I agree that causality is ultimately the right thing to capture in links/parent however this is not a simple concept and the examples above aren't just related but arguably causal.

To avoid writing a book in a github post, I'll try to be brief in summarizing my views on the different types of causality covered here.

Most of the traditional examples for tracing, model what I'll refer to as "direct immediate causality", i.e. Y happed directly as a result of X, and generally, Y happened immediately after X, this is often where we just use the "parent" relationship.

An intermediate use-case where a batch of events is processed together would create an "immediate indirect causality", indirect referring to the fact that it's not clear which of the events X1..n ultimately caused the batch processing Y, it's not clear what to use as a parent or links, perhaps the first in the batch is a parent? or if we have a poll-based system for processing the batch then the instigator of the poll trigger. This is well supported today as everything is known at the beginning of Y.

The cases defined here exhibit a slightly different characteristic, that of an "indirect" (unclear) causality. Unfortunately, both examples don't go into enough detail to demonstrate this, but the implication is that:

  • Y is immediately invoked as a result of X1
  • X2..n arrives during the processing Y
  • The work done in Y is different as a result of X2..n
  • The handling of Y is different than the composition of all sub-parts of Y (i.e. we can't just model Y1..n as children of X1..n)

It is the last point that really introduces the problem that X2..n didn't really "cause" Y to happen, but they DID change the meaning of Y in some way (probably in a lossy transformation of some sort).

I can produce further examples where this pattern happens, but most of them are in event-based systems with streaming mechanics where the implementation of Y != Y1..n is often a performance requirement.

@jmacd
Copy link
Contributor

jmacd commented Feb 15, 2022

I have two positions related to recording span links after creation time:

  1. Let's just call them Events. Events have a name and key:value attributes, I propose the name "spanlink" with attributes "span_id", "trace_id", "tracestate", "sampled". There's no new API created here, no tough questions about sampling, just an equivalent way to record Span linkage.
  2. Let's recognize that "Linkage" is a bi-directional concept. For every link, there's an opposite link that could have been reported in the span that existed first. The "antecedent" span, because it exists first, needs a way to create a link to its subsequent relative. Doesn't it?

@eyjohn
Copy link
Author

eyjohn commented Feb 15, 2022

Thanks for the input @jmacd

  1. Let's just call them Events

While we can arbitrarily use either events (or even attributes) to represent links, (which in essence makes links simply a semantic convention), I fear this detaches some of the semantics of what a link is, and makes it harder for tools to use links for anything meaningful.

An example use-case that would be impacted by this would be the ability for a tool to provide some ability to traverse traces by links, perhaps with an optional filter on attributes of links. While this is still plausible with your suggestion, this puts more burden on tools to detect such links.

  1. Let's recognize that "Linkage" is a bi-directional concept

This works well for some scenarios, such as the "in-queue-with" example (sorry don't have the link to which conversation first mentioned it), where both spans remain alive and the link can be added in both directions. However, in practice, there will be plenty of cases where only the SpanContext is known so a bi-directional relationship is unfeasible to set.

Having said that, for tools/consumers of trace data to create a bi-directional relationship in tools, and allow traversal in either direction, would indeed be a desirable trait. However, I would not see this as an API/data-model constraint but more about a suggestion for tools/vendors.

@pyohannes
Copy link
Contributor

I think the main question that needs an answer is whether the requirement that all links MUST be present when a sampling decision is made is feasible and should be kept. As far as I understand, imposing the current limitation (to prevent adding links after span creation) was a conscious decision made based on this requirement.

With that requirement in mind, it's probably not beneficial to look too much at single scenarios where adding links after span creation seems to make sense (except for finding workarounds for single scenarios).

Also, both proposals @jmacd made ("weak links" and bi-directional links) are working around the requirement, and probably wouldn't be received favorably by those who mandate for it.

Having the limitation in place maximizes possibilities for future link-based samplers, which is fair. However, it needs to be clear that as a trade-off it limits the modelling capabilities OTel offers for traces.

@anuraaga
Copy link
Contributor

I think the main question that needs an answer is whether the requirement that all links MUST be present when a sampling decision is made is feasible and should be kept.

I think this is something to decide as part of the messaging semantic conventions. If semantically messaging can't be modeled well with sampling decisions that don't take into account all links, then it probably should be prohibited. If this isn't true and on the flip side links after the fact are needed for data modeling, then they should be supported. It seems like a decision messaging workgroup should be allowed to primarily own.

@eyjohn
Copy link
Author

eyjohn commented Feb 21, 2022

I think the main question that needs an answer is whether the requirement that all links MUST be present when a sampling decision is made is feasible and should be kept

A quick scan of the specification has yielded this as the primary reference:
https://github.com/open-telemetry/opentelemetry-specification/blob/573c1c68f409cd6b5dc40a31f24b250a965425a3/specification/trace/sdk.md#sampler

While it doesn't explicitly state that Links can't be changed, it is implied by the usage of the Sampler.

I guess it's pretty clear that this change will violate this statement if we consider the sampling decision to only be considered at the creation of the span.

Consequently, the remaining question is how we want to handle this, and more importantly what are the recommendations.

I would argue that deferring the creation of links would be no different to setting attributes on a span after it was created (and presumably after the first sampling decision at creation could be made).

Now both the deferred modifications of Attributes and Links leave the question of whether this would allow a reconsidering of the sampling decision, and how this would be handled (as well as implications of use/propagation of span context up until that point). We can definitely consider this further but I think it is out of the scope of the current proposed change as it applies equally to attributes.

@nerochiaro
Copy link

Sorry for reopening this discussion after such a long time.
It is not clear to me whether a consensus on the issue has been reached and links should NOT be allowed to be added to an existing span.
Or is the discussion still open and there is a possibility that at some point it will be possible to add them ?

@dyladan
Copy link
Member

dyladan commented Oct 14, 2022

If the issue is still open it is not settled.

I don't think the argument that links can't be added later because it might affect the sampling is convincing to me. It forces the user to create another span after getting a batch of messages even if there is already a span for the operation. A good example of this would be something like a lambda function. If the user is using the lambda layer to create a span for the function (my-function.invoke), there is no way to attach links to that span. This forces the user to create a dummy inner span (my-function.invoke-inner) to attach the links to. Not only does this model make no sense, but it inflates their monitoring bill for any backend which charges per span. The argument that the links are required for the sampling decision on my-function.invoke also falls flat here because they're not available anyway. They are available for my-function.invoke-inner but by then the span has a parent which has already made a sampling decision. If my-function.invoke-inner makes a different decision, you either have an orphan span in your backend or you have to do some magic to revert that sampling decision down the line.

There are also no specified samplers to my knowledge which consider span links

@pyohannes
Copy link
Contributor

Or is the discussion still open and there is a possibility that at some point it will be possible to add them?

As @dyladan mentioned, there has been no final decision neither for nor against adding span links after span creation. Also in my opinion, the enhanced modelling capabilities outweigh the restrictions it would introduce (e. g. for sampling).

@nerochiaro If you could describe a scenario where you have a link for adding span links after span creation and add it to this issue, that would be a good input to the discussion.

@nerochiaro
Copy link

nerochiaro commented Oct 18, 2022

My scenario is the same as the one described by @dyladan (we have been discussing it on slack and they added their latest comment based on that discussion).

At the moment there are no workarounds for this problem. Conceivably the lambda instrumentation could be modified to allow passing a configuration option naming a function that gets run at span creation time to provide links and additional attributes.

However this seems overly complicated, and it would delay both span creation AND, as a consequence, the lambda function execution, until the links are available. Additionally (it's not my case, but just for the sake of argument) what if an I/O bound operation needs to be performed to retrieve the links ?

More generally, it seems very inconsistent to me that the spec allows attributes to participate in sampling decisions and still they can be added after creation. But links, which also participate in sampling decisions, can't be added later.

For attributes the spec clearly says Note that Samplers can only consider information already present during span creation. Any changes done later, including new or changed attributes, cannot change their decisions.

Why can't we just treat links the same way, and add a similar note/warning for when they are added after creation ?

This way the people who care about the alleged sampling problems can make sure links are added at span creation time, exactly like now.
But the people who don't care can have the freedom to add them later. What harm would it do ?

@mmanciop
Copy link

mmanciop commented Nov 7, 2022

I came here to post what the message above says, only difference is that I bumped into it when working on the Python SQS instrumentation.

When we consider that, additionally, the trace messaging conventions allow for only one messaging.message_id instead of an array, I see no way that does not involve really ugly and often semantically incorrect "message processing" child spans (one per message, carrying each one message_id and one span link) to support batch receive without losing trace context.

@lmolkova
Copy link
Contributor

lmolkova commented Nov 23, 2022

another scenario for links after start:

  • messaging producer publishes a batch of messages.
  • as a response it gets a list of receipt numbers/sequence numbers/messaging ids assigned by broker, which can be used to cancel/alter/etc messages from the producer side. recording them would be beneficial. E.g. AWS SQS returns message-id and sequence number, Azure ServiceBus also returns sequence number in response to publish
  • where should instrumentation put them? within current limitations, they have to be put on events (or as attribute arrays on publish span) with some additional logic to correlate individual messages to their receipt numbers.

Proposal:

Let's treat links as events that have type/name, attributes, parent context and another linked context and can be added at any moment after span starts. With new event/log semantics, they can be even added after both spans end - I remember there were scenarios like this in internal Azure instrumentations.

@lmolkova
Copy link
Contributor

Related discussion on sampling and links #2918

@blumamir
Copy link
Member

blumamir commented Feb 7, 2023

Adding another usecase.

When a timer is being registered in the application, we crafted a custom context manager to invoke the timer's callback as a new trace (root context), and store a link on the span entry to the span which registered the timer (so we have the context on why the timer is invoked).

To do that, we had to hook into the span creation and check for the relevant key in the context and create a link automatically when appropriate. There is the OnStart function on the SpanProcessor which is perfect for that, but it is only invoked after the span has already been created, thus we cannot add links at this point. If we could add the link in the span processor, that would simplify our implementation (which is currently patching the startSpan function).

@blumamir
Copy link
Member

blumamir commented Feb 7, 2023

Another example where this might be needed. Some messaging libraries receive messages as a stream. for example nats package in JS uses JS generator in the API to consume messages:

  for await (const m of sub) {
    console.log(`[${sub.getProcessed()}]: ${sc.decode(m.data)}`);
  }
  console.log("subscription closed");

I don't have a specific usecase in mind, but it is an example where the messages might not be available upfront thus instrumentation will need to record them "as they arrive"

@carlosalberto
Copy link
Contributor

Trying to summarize some notes from existing proposals to fix this:

  1. Add a Span.AddLink() operation.
  • Pros: General agreement.
  • Cons: It would break interfaces for a few SIGs, such as C++ or Golang, as adding a new member to an interface is a breaking change.
  1. Add an optional Tracer.RecordLink() operation that records link data (spancontext, attributes) as an Event (similar to RecordException().
  • Pros: This as an optional operation, and this means the Span interface wouldn't break for the previously mentioned SIGs. It could be optionally defined somewhere else (instrumentation or helper package).
  • Cons: We would end up with Links having two different representations (original proto value, event value). Alternatively, we could try to deprecate the current Link proto value, and use Events only, but that's in itself a big discussion/decision we would need to have (and come up with a final decision).
  1. Add the AddLink() operation either as a Span operation, or as a module-level operation such as opentelemetry.api.extra.AddLink(Span, Link).
  • Pros: It would set the foundation for adding operations to interfaces without breaking them (C++, Golang).
  • Cons: It may confuse users, as they won't find Span.AddLink() for some languages, although the best could be said for 2).

@MrAlias
Copy link
Contributor

MrAlias commented Mar 21, 2023

Option 2 has this same limitation:

Cons: It would break interfaces for a few SIGs, such as C++ or Golang, as adding a new member to an interface is a breaking change.

@MrAlias
Copy link
Contributor

MrAlias commented Mar 21, 2023

Are we missing option 4 where adding a link is done via AddEvent?

@MrAlias
Copy link
Contributor

MrAlias commented Mar 21, 2023

Have we completely ruled out setting a link on a span ending?

@jmacd
Copy link
Contributor

jmacd commented Mar 21, 2023

My preference is still the span event route, I would like to read something like:

spanA.Event("spans collide",
    trace.WithSpanLink(spanB),
    trace.WithStackTrace(true),
    trace.WithAttributes(
        attribute.String("prop1", "..."),
        attribute.Int("shard", ...)))

@tedsuo
Copy link
Contributor

tedsuo commented Mar 21, 2023

Please, let us minimize the blast radius to adding this method back in.

Remember, AddLink was originally a method on the Span. We removed it because we hoped that we could limit linking to span start. Turns out we can't. So let's just add it back.

Mixing linking up with SpanEvents would be a terrible idea. Links are a very important part of trace structure. Traces are graphs. Spans are nodes. Links are edges. This is a very important part of the internal structure of our tracing system, and we want to use types to enforce this structure at the data model level.

  • With SpanEvents, there would be no enforcement of structure. It would all be done by convention, using dictionaries.
  • SpanEvents include end user data. We would be mixing up our internal structure with arbitrary data.
  • We would now have links living in two places on our data model, both as Span.Links and hiding in Span.SpanEvents.
  • SpanEvents may well be deprecated in favor of Logs. It is confusing to our users to have both. Logs may not even be sent to a tracing backend, they might be tee'd off to a separate logging system. This is a terrible place to put data that is critical to the structure of our traces.

Now, if we are not in fact saying that Links should be SpanEvents, but instead saying that Links will still be completely separate from SpanEvents in our data model, we just want to co-opt the SpanEvents API because it feels convenient – that is terrible API design. APIs should not be "found art," where you overload one facility because it happens to take a dictionary as an argument. We are trying to make a standard tracing API that is going to be spread everywhere and last for years. Our users deserve for that API to be built with clarity and intention.

Please, please do the normal, ordinary, clear, concise thing. Add the SpanLink method back and be done with this.

@jmacd
Copy link
Contributor

jmacd commented Mar 21, 2023

@tedsuo I just want to make sure the example I gave is well understood.

With SpanEvents, there would be no enforcement of structure. It would all be done by convention, using dictionaries.

I would suggest that if the span event contained one of these trace.WithSpanLink() options, the result would be to create a two links (one per span) and one Event. The Links would have enforced structure for both the linking-to and linking-from span.

SpanEvents include end user data. We would be mixing up our internal structure with arbitrary data.

The event would have the user data ("spans colliding", "prop1", "shard", stacktrace). I haven't described how to correlate the span link and the event.

Logs may not even be sent to a tracing backend, they might be tee'd off to a separate logging system.

I call this an implementation detail. The trace SDK can intercept the stream of logs and put them back into the span stream if it wants.

It would all be done by convention, using dictionaries.

We have to accept conventions in this group, it's a lot of what we do. We have lots of ways to correctly identify information, and in this case I suggest the use of instrumentation scope names and schema identifiers. In this future world where logging replaces span events, we're going to have to be able to recognize a structured object and check whether it meets the schema. I would advocate for JSON schema so that we log things, by convention, with JSON objects including dictionaries, lists, numbers, etc and interpret them correctly.

@lmolkova
Copy link
Contributor

lmolkova commented Mar 21, 2023

Remember, AddLink was originally a method on the Span. We removed it because we hoped that we could limit linking to span start. Turns out we can't. So let's just add it back.

Mixing linking up with SpanEvents would be a terrible idea. Links are a very important part of trace structure. Traces are graphs. Spans are nodes. Links are edges.

There is a discussion in
#3240 on links vs events (in over-the-wire). The arguments for event(log)-like thing to carry them:

  • decouple links from span lifetime - there is no reason to require an instance of span to record an edge between spans
  • timestamp is useful
  • attributes on links are describing things like messages in a batch rather than relationships between spans.

Essentially, remote context is an edge, but links are bigger than that currently.
I just added a proposal here #3240 (comment) to limit link definition to the remote context(s) and to allow using it on log records.

I.e. we can have a specialized log-record/event that carries a link and API would use link terminology.

The essential part (however we do links): allow recording them whenever they happen, even after span ends

Update. Having said that, I'm happy to postpone allowing to record links after both spans end for any long period of time. Simple AddLink on Span will unblock messaging SIG and would cover 90% of existing cases and we can revisit links vs events discussion after.

@tedsuo
Copy link
Contributor

tedsuo commented Mar 22, 2023

@lmolkova a big problem with decoupling links from spans and putting them in the log stream is that the log stream may not be sent to the tracing system.

If we are going to blue sky, I'm not against some kind of v2.0 protocol where traces are broken down into a stream of events. Span start and span end could also be events. In general, telemetry pipelines would become more efficient with an event-based streaming model. Long running spans that measure things like the life span of a process would be more feasible. Plenty of benefits. Also plenty of problems.

But either way, event-based tracing is not our current model. Currently, we have a trace, log, and metric signal. And the trace signal only contains spans and resources.

So, in our current model, mixing trace structure in with the log signal creates a lot of practical issues.

We could decouple links from spans by adding links directly to the trace stream. But in my opinion, that is blowing up the current issue far beyond what is necessary to solve our current problem. Namely, we want to add links to spans, but currently we lack a method to do so.

@lmolkova
Copy link
Contributor

@tedsuo I agree that there are problems with existing tooling if links after span ends are supported. And I agree as I mentioned above, that we can address it later and 90% of cases don't need it. So I'm all in for adding AddLink right away.

What I'm trying to say is that the rest of the scenarios won't disappear because we don't have tooling and we might need to solve this in a long run. I do think the current over-the-wire link format is problematic in the long run.

@MadVikingGod
Copy link
Contributor

I would suggest that if the span event contained one of these trace.WithSpanLink() options, the result would be to create a two links (one per span) and one Event. The Links would have enforced structure for both the linking-to and linking-from span.

How would that actually work? The span you are linking to might be in a different machine, process, or even closed. When the AddLink() is a method on a span, we know there is one active side (the span you call this on) and the identifier of the other end. There would have to be another communication method to represent that two-way link.

no-reply pushed a commit to no-reply/opentelemetry-ruby-contrib that referenced this issue Jul 11, 2023
…emetry#526)

open-telemetry/oteps#220 recommends:

>  For each message it accounts for, the "Deliver" or "Receive" span SHOULD link
to the message's creation context. In addition, if it is possible the creation
context MAY be set as a parent of the "Deliver" or "Receive" span.

But also acknolwedges:

> open-telemetry/opentelemetry-specification#454 To instrument pull-based
"Receive" operations as described in this document, it is necessary to add links
to spans after those spans were created. The reason for this is, that not all
messages are present at the start of a "Receive" operations, so links to related
contexts cannot be added at the start of the span.

Our "Receive" spans do not link to the message's creation context, and indeed it
doesn't seem possible to do so. Likewise for setting the parent.

This tries to do it anyway, and results in:

  - "Receive" span links remain unset;
  - "Receive" span is a root span for a new trace;
  - "Process" span trace_id is correctly set to the "Send" context;
  - "Process" span's parent is incorrectly set to "Send" instead of "Receive".
no-reply pushed a commit to no-reply/opentelemetry-ruby-contrib that referenced this issue Jul 11, 2023
…emetry#526)

open-telemetry/oteps#220 recommends:

>  For each message it accounts for, the "Deliver" or "Receive" span SHOULD link
to the message's creation context. In addition, if it is possible the creation
context MAY be set as a parent of the "Deliver" or "Receive" span.

But also acknolwedges:

> open-telemetry/opentelemetry-specification#454 To instrument pull-based
"Receive" operations as described in this document, it is necessary to add links
to spans after those spans were created. The reason for this is, that not all
messages are present at the start of a "Receive" operations, so links to related
contexts cannot be added at the start of the span.

Our "Receive" spans do not link to the message's creation context, and indeed it
doesn't seem possible to do so. Likewise for setting the parent.

In lieu of a specification layer solution, this proposes adding a configuration
flag so that users can opt-in to setting the "Send" context as parent for the
"Process" span. This comes at the cost of a continuous trace including the
"Send"-"Receive"-"Process" spans, with "Receive" orphaned from both "Send" and
"Process".

This doesn't attempt to reproduce the "none" `propagation_style` included in
`sidekiq` instrumentation and other similar gems. If this is an interesting
direction, we could consider implementing it.
@atshaw43
Copy link

Chiming in,

We need to create span links after span creation for receive spans for SQS receive message. But spans assume that span links are available at span creation time.

I am thinking of creating a facade span for downstream propagation and then creating the real span later once we have the span links. I will try to have the facade span ended without emitting it.

This seems unorthodox, but I don't see a better solution given the spec.

carlosalberto added a commit that referenced this issue Oct 10, 2023
Fixes #454 

Related to #3337 

As the Messaging SIG merged its last OTEP 222, we will be adding operations
that require Links after Span creation, taking a direct route with `AddLink()`,
albeit without any of the new members suggested in #3337, e.g. `timestamp` (to be
discussed in a separate issue).

```
AddLink(spanContext, attributes /* optional */)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:api Cross language API specification issue area:sampling Related to trace sampling area:span-relationships Related to span relationships release:after-ga Not required before GA release, and not going to work on before GA spec:trace Related to the specification/trace directory
Projects
Status: V1 - Stable Semantics