Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete recent recent traces with elasticsearch backend #2231

Open
Stono opened this issue May 7, 2020 · 4 comments
Open

Incomplete recent recent traces with elasticsearch backend #2231

Stono opened this issue May 7, 2020 · 4 comments
Labels
help wanted Features that maintainers are willing to accept but do not have cycles to implement

Comments

@Stono
Copy link

Stono commented May 7, 2020

Hey,
This isn't a problem with Jaeger per-se, more of an issue when you're using Jaeger with Elasticsearch backend.

Because spans come in asynchronously (and batched from collectors), and then the index is refreshed on an interval (say, 10s), its quite typical for the most recent traces to appear broken:

Screenshot 2020-05-07 at 22 43 39

Then when you refresh 10/15s later, it's all nice complete:

Screenshot 2020-05-07 at 22 43 47

This leads to such an unbelievable amount of "tracing is broken" comments to the platform team, so i'm trying to think about ways to handle it.

There is no requirement for "near real time" traces, so ideally i'm trying to find a way to ensure that traces don't become visible in search for say, 60s after the first span has come in (subsequently giving time for all spans to come in).

The only way I can think of doing this would be if elasticsearch exposed a time offset configuration for this storage backend, which effectively never returned results newer than 1 minute old when using the Last Hour time default.

Or perhaps have a config option to suppress results which are "trace-without-root-span"?

I'm open to ideas!

@ghost ghost added the needs-triage label May 7, 2020
@yurishkuro
Copy link
Member

This was raised a few time indeed. I don't think it can be solved purely by limiting the search to [now-lookback, now-1min], because once any span falls into this range the whole trace will be returned, even if it's still in-progress. An alternative might be implementing a "window" parameter in the query service so that after the search is executed and traces are loaded in memory, those that have any spans within [now-window, now] are discarded as potentially incomplete. They could also be marked as potentially incomplete in the UI.

We're currently working on tail-based sampling implementation that requires pre-aggregation of spans in memory, which naturally introduces a window of inactivity before the whole trace is saved & indexed, so that would also address this problem.

@yurishkuro yurishkuro added help wanted Features that maintainers are willing to accept but do not have cycles to implement and removed needs-triage labels May 7, 2020
@pavolloffay
Copy link
Member

This is related to jaegertracing/jaeger-ui#462. The UI could signalize missing spans in certain situations (e.g. missing spans in the middle) it's clear that those spans haven't been yet reported/persisted.

If there are missing spans UI signalize skipping clock skew adjustment which is confusing for users.

@Stono
Copy link
Author

Stono commented May 11, 2020

Yeah this could even be a UI concern which makes it very prominent that we're displaying an incomplete trace. Anything to save the user confusion!

@badihi
Copy link

badihi commented May 6, 2023

I was investigating on these invalid parent span IDs, skipping clock skew adjustment errors almost for a week! Now I found out that I just have to wait for all of spans to be collected.
Maybe I'm stupid! But stupid people are not rare nowadays! There must be some instructions to prevent the confusion in the UI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Features that maintainers are willing to accept but do not have cycles to implement
Projects
None yet
Development

No branches or pull requests

4 participants