-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a TraceGraph to the TracePage #276
Conversation
love it 🎉 |
This is great!! |
@@ -0,0 +1,99 @@ | |||
/* | |||
Copyright (c) 2017 Uber Technologies, Inc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(c) 2018 The Jaeger Authors
looks like there are linter errors |
Sorry for this. Fixed the lint/flow errors now. Some tools are now working with some adaptions. Fixed 2 cases with eslint-disable-next-line no-param-reassign. For ideas how to fix this properly I am open. Javascript is not my day2day work. WIP: I will add some tests. |
Codecov Report
@@ Coverage Diff @@
## master #276 +/- ##
==========================================
+ Coverage 77.52% 81.83% +4.31%
==========================================
Files 137 139 +2
Lines 2976 3066 +90
Branches 617 633 +16
==========================================
+ Hits 2307 2509 +202
+ Misses 528 446 -82
+ Partials 141 111 -30
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic! Thank you for putting this PR together! I think folks are definitely going to find this useful :) 👍
Triggering this view
This is a great view, and the discoverability of a third option in "View Options" is pretty low. I think it makes sense to turn the View Options into a button with a dropdown menu, where the button is Trace Graph and the dropdown options are links to the JSON views. What do you think?
Aggregating on service and operation or by graph node
The stats for a node are based on the service and operation rather than being specific to that node. As a consequence, many nodes may have the same stats. And, while that in itself is not an issue, it implies each node has those stats rather than they all have those stats.
It seems more intuitive for the stats to be aggregated on a basis which matches the visual representation. For instance, in trace-diffs, the values associated with a node are specific to that node.
The current svc + op aggregation is possibly a bit confusing and is inconsistent with trace diffs.
We explored alternate aggregations in #252, which I think looks very promising. That approach would match the aggregation to the visuals, and adjust both when cycling through different aggregations. (And, service + group is one of the options.)
What do you think about aggregating on a node-by-node basis, so that each node in the graph conveys values that are specific to, and derived from, that node?
If we switch to aggregating on the node basis, I think a change to the DagNode
class will simplify the aggregation, described in "Alternate aggregation", below.
Deriving the counts, durations, etc
I'm interpreting the percent of time in method as "self-time" in the span, which is to say time where there are no children. Please let me know if that's incorrect.
When I viewed the trace graph on a larger trace (~3400 spans) I noticed:
- The performance of the aggregation is a bit heavy
- Sometimes there nodes where the count is 0 (same for duration, etc)
- Sometimes the time in the method is negative
Re 1. and 2., I think the change described in "Alternate aggregation" would resolve this.
Re 3., I think the current approach is to subtract the sum of children duration from the parent duration. One issue with this is it doesn't account for concurrency. For instance, if a parent is 1 second in duration and it spawns 10 children, in parallel, which are each 500ms and finish at the same time, the time in method will show -4 seconds instead of 500ms (roughly speaking). If I've got this wrong, let me know.
Changing to use intersecting ranges should resolve this, maybe something like node-drange might work?
What do you think?
Alternate aggregation
The change described below is only relevant if the aggregation is on a node-by-node basis, instead of service and operation.
On master, DagNode
has a count: Number
member which is the number of DenseSpan
s grouped into the DagNode
.
For an experimental feature to do trace comparisons on span durations, I switched DagSpan
to keep track of the DenseSpan
s it encapsulates instead of just the count
:
jaeger-ui/packages/jaeger-ui/src/model/trace-dag/DagNode.js
Lines 32 to 36 in 1d28725
parentID: ?NodeID; | |
id: NodeID; | |
// count: number; | |
members: DenseSpan[]; | |
children: Set<NodeID>; |
And, the corresponding change: TraceDag.js#L96-L99.
With this change, the members
is an array of DenseSpan
, which has a .span
property, so the aggregation would be across .members[*].span
for each node in the graph. I think then the aggregations can be calculated from the root nodes to the leaves, in a single pass, with efficient lookups, when necessary.
I might be inadvertently assuming prior knowledge or just plain doing a poor job communicating. So, let me know if I'm not making sense and I'll take another crack at it.
Lastly
As I was looking through the trace-diff and model/trace-dag/* code, I noticed the comments must be in the future, because they're certainly not in the code. (Doh! smh)
Thank you for taking the time to put this together! 🙏
Well the big thank goes to you because I just added a minor calculation. Regrading your points:
|
4136e29
to
db2ac3b
Compare
0627911
to
a02e727
Compare
@copa2 – Great changes! Regarding self-time, in the previous version of the branch, the self-time for Previously - Large negative Current - Small positive To check things out, I hacked up an edit to the HotROD demo that makes the diff --git a/examples/hotrod/services/driver/server.go b/examples/hotrod/services/driver/server.go
index 274ceb7..5212d0c 100644
--- a/examples/hotrod/services/driver/server.go
+++ b/examples/hotrod/services/driver/server.go
@@ -15,6 +15,8 @@
package driver
import (
+ "sync"
+
"github.com/opentracing/opentracing-go"
"github.com/uber/jaeger-lib/metrics"
"github.com/uber/tchannel-go"
@@ -78,25 +80,35 @@ func (s *Server) FindNearest(ctx thrift.Context, location string) ([]*driver.Dri
driverIDs := s.redis.FindDriverIDs(ctx, location)
retMe := make([]*driver.DriverLocation, len(driverIDs))
+ wg := sync.WaitGroup{}
+ rvLock := sync.Mutex{}
for i, driverID := range driverIDs {
- var drv Driver
- var err error
- for i := 0; i < 3; i++ {
- drv, err = s.redis.GetDriver(ctx, driverID)
- if err == nil {
- break
+ wg.Add(1)
+ go func(driverID string, i int) {
+ var drv Driver
+ var err error
+ for i := 0; i < 100; i++ {
+ drv, err = s.redis.GetDriver(ctx, driverID)
+ if err == nil {
+ break
+ }
+ s.logger.For(ctx).Error("Retrying GetDriver after error", zap.Int("retry_no", i+1), zap.Error(err))
+ }
+ // if err != nil {
+ // s.logger.For(ctx).Error("Failed to get driver after 3 attempts", zap.Error(err))
+ // // wg.Done()
+ // return
+ // }
+ rvLock.Lock()
+ retMe[i] = &driver.DriverLocation{
+ DriverID: drv.DriverID,
+ Location: drv.Location,
}
- s.logger.For(ctx).Error("Retrying GetDriver after error", zap.Int("retry_no", i+1), zap.Error(err))
- }
- if err != nil {
- s.logger.For(ctx).Error("Failed to get driver after 3 attempts", zap.Error(err))
- return nil, err
- }
- retMe[i] = &driver.DriverLocation{
- DriverID: drv.DriverID,
- Location: drv.Location,
- }
+ rvLock.Unlock()
+ wg.Done()
+ }(driverID, i)
}
+ wg.Wait()
s.logger.For(ctx).Info("Search successful", zap.Int("num_drivers", len(retMe)))
return retMe, nil
} |
NB: the calls to route service are already parallel in HotROD. It would be nice to add a cmd line option to enable parallelism of the GetDriver calls as well, using a similar executor pool as for the route-svc. |
@copa2 – Looks great! I'm really stoked about this diff! And, glad the Looks great. I had kind of a long comment regarding the implementation of Node metrics / aggregations Currently, the metrics shown on the nodes are:
This very much gives a sense of scale for the node with regard to how it contributed to the trace, as a whole. And, it is possible to compare groups with one another, but mainly as "How much did this group contribute to the trace vs that group", so, still as the node relates to the full trace. Since everything is the sum, the metrics are very sensitive to the count. And, that makes it tough to compare groups of different sizes. What do you think of the following metrics?
The left-of-center metrics are cumulative and speak to the contribution to the trace as a whole. The right metrics are averages. Seems like this might make it possible to compare groups against each other as well as see how a node contributes to the trace, overall. Screenshot of a mock: What do you think? Landing the graph and following up with the metrics Regarding which metrics to show in the node and which to base the color coding visualization on, seems like these warrant some exploration? I'm curious to hear from others, although gathering feedback on in-development UI features has been a bit over-looked. Two questions: How would you feel about presenting this work at the Jaeger bi-weekly to see if there is any feedback or suggestions? OTOH, landing the Trace Graph view would be great. How would you feel about landing it with only the counts metric visible and the other metrics hidden behind a feature flag (query paramter)? That would allow some breathing room for gathering feedback and mulling over the metrics? Awesome work! Thanks so much :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I thought I submitted this feedback with the comment about the metrics!
packages/jaeger-ui/src/components/TracePage/TraceGraph/OpNode.css
Outdated
Show resolved
Hide resolved
packages/jaeger-ui/src/components/TracePage/TraceGraph/OpNode.js
Outdated
Show resolved
Hide resolved
packages/jaeger-ui/src/components/TracePage/TraceGraph/OpNode.js
Outdated
Show resolved
Hide resolved
@tiffon - Thanks for the review. Really appreciate the constructive feedback. Regarding your points: Node metrics / aggregations Why do I use sums? They are easy to understand :-) I haven't tested with enough real data, so I am still trying to find out which data helps most. Landing the graph and following up with the metrics Color coding is currently on mode. Personally I think the component is not quite ready for prime time. I am also unhappy with some points. More feedback and suggestions also from other people is appreciated. I hoped this would happen within this pull request. |
@copa2 Thanks for your response. Re feedback in the PR, I agree that's preferable, but the bi-weekly will reach folks that might not typically respond. I can't present at this weeks bi-weekly, either (Happy Birthday to your niece!). But, can you show it at the following bi-weekly? It's a great feature to showcase, and feedback will be useful regardless of whether the PR is merged to master by then, or not. If so, I'll add a spot to the agenda. I'm OOO for a few days but will respond more fully when I return. |
@copa2 this is a great work. We would like to include it in Jaeger soon. Are there any blockers for this to be finished? |
@pavolloffay / @tiffon I see two possibilities:
To the current WIP which is more a lot of experimenting from me and not in a commitable state:
Also experimenting with different Node displays: @tiffon: This friday is bi-weekly. I will shortly showcase this WIP if you want. |
It would be great to discuss at the status call. I am also in favor of merging something as experimental MVP and iterate on that. |
@copa2 The progress and experiments look great! I've added an item to the bi-weekly agenda. Looking forward to learning more! |
@copa2 This change set looks great! I am also in favor of merging experimental work, pending two requests:
|
I am +1 on tagging a feature as "Experimental" (not WIP), but I am -1 on hiding it behind feature flags. Let's show it and have users raise questions. If we hide it we won't get any feedback. |
@tiffon / @yurishkuro / @everett980 |
I find the toggle button a bit confusing. I would prefer that it displays the name of the current view, and clicking on it opens a menu to select another view (including JSON view).
Certainly let's keep future ideas recorded in a meta-ticket. |
Well had this in the first version(see gif on top). Tiffon requested to change this his review:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the nodeMode state up to TraceGraph
instead of OpNode
is great to see!
I have outlined a few, much-easier changes to make but then I believe it is ready to merge.
packages/jaeger-ui/src/components/TracePage/TraceGraph/TraceGraph.js
Outdated
Show resolved
Hide resolved
Added experimental ribbon which links to #293 |
@everett980: Thanks for your feedback. Applied your requested changes. I will merge the commits into one commit when all reviews are ok. Open points:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments. I'd say the main two are the #rrggbbaa
only being supported by 80% or browsers and the DagNode#count
comment.
Great work! 🎉
packages/jaeger-ui/src/components/TracePage/TraceGraph/OpNode.js
Outdated
Show resolved
Hide resolved
packages/jaeger-ui/src/components/TracePage/TraceGraph/TraceGraph.js
Outdated
Show resolved
Hide resolved
packages/jaeger-ui/src/components/TracePage/TraceGraph/TraceGraph.css
Outdated
Show resolved
Hide resolved
packages/jaeger-ui/src/components/TracePage/TraceGraph/TraceGraph.js
Outdated
Show resolved
Hide resolved
The gif you showed looks great! Sorry for the mixed signals! |
No problem. UIs are always a matter of taste :-) More discussion normaly bring better results. Open now:
Will merge and rebase to master when reviews are through. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for adding the experimental link 👍
And...
Thanks for the awesome PR! 🎉
Add alternative view in TracePage which allows to see count, avg. time, total time and self time for a given trace grouped by service and operation. Signed-off-by: Patrick Coray <patrick.coray@bluewin.ch>
Rebased and squashed commits. @everett980: Let me know if you want further changes. Otherwise we should release this and apply improvements with #293. 😆 Altered Beast - my first game I remember was Galaga |
Add alternative view in TracePage which allows to see count, avg. time, total time and self time for a given trace grouped by service and operation. Signed-off-by: Patrick Coray <patrick.coray@bluewin.ch> Signed-off-by: Everett Ross <reverett@uber.com>
Add alternative view in TracePage which allows to see count, avg. time, total time and self time for a given trace grouped by service and operation. Signed-off-by: Patrick Coray <patrick.coray@bluewin.ch> Signed-off-by: vvvprabhakar <vvvprabhakar@gmail.com>
Add alternative view in TracePage which allows to see count, avg. time, total time and self time for a given trace grouped by service and operation. Signed-off-by: Patrick Coray <patrick.coray@bluewin.ch> Signed-off-by: Everett Ross <reverett@uber.com> Signed-off-by: vvvprabhakar <vvvprabhakar@gmail.com>
Resolves #273
This PR adds a TraceGraph view to the TracePage. This is an alternate view which summarize data of a trace: Count, TotalTime, % of time, in method time grouped by service and operation
(#273 (comment)).