Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to limit number of spans per trace for UI #2496

Open
timothybasanov opened this issue Apr 12, 2019 · 11 comments
Open

Allow to limit number of spans per trace for UI #2496

timothybasanov opened this issue Apr 12, 2019 · 11 comments
Labels

Comments

@timothybasanov
Copy link

Feature:
Additional configuration parameter to limit number of spans within a single trace.
Can potentially be applied to search results and/or in-memory storage.

Rational
Accidentally running a big fan-out job may create trace with a million spans. This affects user in several ways:

  • Zipkin UI has JavaScript stack overflow exceptions when there are too many spans to render
    • Even when it does not throw browsers tend to hang when dealing with that much data
  • zipkin2.server.internal.ZipkinQueryApiV2#writeTraces has an integer overflow when resulting JSON is more than 2GB in size
    • User does not know which span is big, but as soon as he hits it in search results it throws 500 errors from the server. So he can not exclude one bad trace from results
    • non-200 errors are not rendered in Zipkin UI, so user has no way of knowing what went wrong.
  • In-Memory storage behaves really poorly as it iterates over all spans within a single trace
    • This disproportionally affect load-testing jobs on a dev machine.

Proposed solution:

  • Additional configuration parameter that limits total number of spans sent from Zipkin UI within a single trace
    • Dropping some data is better then failing the whole search result
    • Ideally it should show in end-user UI that some spans were dropped and view is incomplete
      • This is hard to implement, temporary solution: after dropping some spans add one span back that would say "error: too many spans within a trace, had to drop some on the floor"
  • Additional configuration parameter that limits total number of spans within a trace for In-Memory storage
    • Dropping some data may be better than bringing down the whole server when it either runs out of memory or gets stuck in an endless GC loop.
@drolando
Copy link
Contributor

It'd be nice if we could paginate the response if there are more than let's say 10k spans. It might be hard though to figure out a nice way to display that though

@codefromthecrypt
Copy link
Member

as these sorts of issues tend to be re-hash in nature, and usually recreate content in various places, can we move parts of the discussion where some history exists?

We don't have any issue at the moment on mitigating based on span count at ingress (collection). I think this issue has new discussions on that point, and probably deserves to put mind on it.

@codefromthecrypt
Copy link
Member

some notes about the issue here.

  • in-memory storage is explicitly mentioned as not for production. we have some other in-memory stuff cooking ex https://github.com/adriancole/zipkin-voltdb
  • even if in-memory was ok, I don't know how many backends would work with (million-scale) traces.
  • to the best of my knowledge we've not yet discussed span count based dropping at server level

I do think span count pruning has been discussed in brave, but there are issues about how that affects things. collector side would involve state to track it.. this is possible to do in single instance scenarios or where there are some shared state you can look up based on trace ID. I can see why someone would wonder about if we could special case this in the in-memory collector as we already have collections by trace ID.

@codefromthecrypt
Copy link
Member

@timothybasanov ps do you mind hopping on gitter for the load test related questions? I want to dig into them but not wander the issue too much https://gitter.im/openzipkin/zipkin

@timothybasanov
Copy link
Author

We probably can close this ticket (and related) as duplicates of one big ticket.
I'll try to talk on Gitter in more detail on what I'm doing locally with tests. :)

I think other tickets are slightly different from this one: they are about slow UI, this one is about UI silently crashing with exceptions (overflow and/or browser crash) and server silently failing with exceptions (integer overflow). I don't expect Zipkin to work well with big traces, I just want it to not to crash.

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Apr 12, 2019 via email

@codefromthecrypt
Copy link
Member

on this particular point of too big to load at all. there seem to be two ways about it if we are focusing on after the fact (UI) vs before (collection)

  • we can identify based on size (easiest possibly a HEAD request)
  • we can identify based on span count or some other trace scoped heuristic (engineering required if we want to stop browser from crashing reading the json)

while UI is primary user of api, it isn't the only user. so changing the API with special hooks is likely not worth doing. The api is not suited enough to retrofit to things like pagination quickly also. This issue is about preventing crashes so we can keep that in mind.

there are three entry points which this limiting concerns:

  • trace list screen (one trace in results could be poison)
  • trace by ID and trace by JSON (the only trace loaded could be poison big)

I can think of some remedies, and we have to remember this is different than the other issue about too big, but not too big to crash.

I think simple way out would be to just refuse to load when payload is larger than say 5MiB by default, with a setting one can update. I prefer size based as traces can be huge for other reasons not just span count like too many tags or annotations.

Another way that comes to mind is look at what tools exist for incremental parsing of json. cc @openzipkin/ui

@codefromthecrypt
Copy link
Member

opened #2498 on the in memory storage special casing though it wont work for other storage. rationale being that it is meant for testing so maybe we can improve the limiter

@codefromthecrypt
Copy link
Member

suggested by @jorgheymans on what to do when said limit is reached in #2554

I think that a drill-down mode could be more helpful here, a natural way to investigate a trace is by clicking your way through the parts that take up the most time until you get to the leaf spans.

@codefromthecrypt
Copy link
Member

fwiw @bulicekj at the last UI workshop had the same idea I think, which was some sort of progressive loading in a trace. there are some issues about that as we do validation up front etc. but anyway I think this is helpful https://cwiki.apache.org/confluence/display/ZIPKIN/2019-04-17+UX+workshop+and+Lens+GA+planning+at+LINE+Fukuoka

@codefromthecrypt
Copy link
Member

actually I missed #2411 is the better issue for the progressive loading.. argh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants