Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a TraceGraph to the TracePage #276

Merged
merged 1 commit into from
Dec 18, 2018
Merged

Conversation

copa2
Copy link
Contributor

@copa2 copa2 commented Nov 13, 2018

Resolves #273

This PR adds a TraceGraph view to the TracePage. This is an alternate view which summarize data of a trace: Count, TotalTime, % of time, in method time grouped by service and operation
(#273 (comment)).

jaegeruitracegraph2

@yurishkuro
Copy link
Member

love it 🎉

@whistlinwilly
Copy link

This is great!!

@@ -0,0 +1,99 @@
/*
Copyright (c) 2017 Uber Technologies, Inc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(c) 2018 The Jaeger Authors

@yurishkuro
Copy link
Member

looks like there are linter errors

@copa2
Copy link
Contributor Author

copa2 commented Nov 14, 2018

Sorry for this. Fixed the lint/flow errors now. Some tools are now working with some adaptions.
I do development on windows so pre-commit with lint(prettier/lint/glow/check-licence) is not working.

Fixed 2 cases with eslint-disable-next-line no-param-reassign. For ideas how to fix this properly I am open. Javascript is not my day2day work.

WIP: I will add some tests.

@codecov
Copy link

codecov bot commented Nov 14, 2018

Codecov Report

Merging #276 into master will increase coverage by 4.31%.
The diff coverage is 90.1%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #276      +/-   ##
==========================================
+ Coverage   77.52%   81.83%   +4.31%     
==========================================
  Files         137      139       +2     
  Lines        2976     3066      +90     
  Branches      617      633      +16     
==========================================
+ Hits         2307     2509     +202     
+ Misses        528      446      -82     
+ Partials      141      111      -30
Impacted Files Coverage Δ
...ger-ui/src/components/TracePage/TracePageHeader.js 70% <ø> (ø) ⬆️
packages/jaeger-ui/src/model/trace-dag/TraceDag.js 58.69% <100%> (+58.69%) ⬆️
...r-ui/src/components/TracePage/TraceGraph/OpNode.js 100% <100%> (ø)
packages/jaeger-ui/src/model/trace-dag/DagNode.js 100% <100%> (+100%) ⬆️
...ckages/jaeger-ui/src/components/TracePage/index.js 71.01% <33.33%> (-1.05%) ⬇️
.../src/components/TracePage/TraceGraph/TraceGraph.js 89.7% <89.7%> (ø)
...neViewer/TimelineHeaderRow/TimelineViewingLayer.js 88.88% <0%> (+1.85%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f51dfd7...0c51fc7. Read the comment docs.

@yurishkuro
Copy link
Member

@tiffon please review, would love to merge this soon.

@copa2 is it possible to add tests? Only 5.12% of the new code is tested.

Copy link
Member

@tiffon tiffon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic! Thank you for putting this PR together! I think folks are definitely going to find this useful :) 👍

Triggering this view

This is a great view, and the discoverability of a third option in "View Options" is pretty low. I think it makes sense to turn the View Options into a button with a dropdown menu, where the button is Trace Graph and the dropdown options are links to the JSON views. What do you think?

Aggregating on service and operation or by graph node

The stats for a node are based on the service and operation rather than being specific to that node. As a consequence, many nodes may have the same stats. And, while that in itself is not an issue, it implies each node has those stats rather than they all have those stats.

It seems more intuitive for the stats to be aggregated on a basis which matches the visual representation. For instance, in trace-diffs, the values associated with a node are specific to that node.

The current svc + op aggregation is possibly a bit confusing and is inconsistent with trace diffs.

We explored alternate aggregations in #252, which I think looks very promising. That approach would match the aggregation to the visuals, and adjust both when cycling through different aggregations. (And, service + group is one of the options.)

What do you think about aggregating on a node-by-node basis, so that each node in the graph conveys values that are specific to, and derived from, that node?

If we switch to aggregating on the node basis, I think a change to the DagNode class will simplify the aggregation, described in "Alternate aggregation", below.

Deriving the counts, durations, etc

I'm interpreting the percent of time in method as "self-time" in the span, which is to say time where there are no children. Please let me know if that's incorrect.

When I viewed the trace graph on a larger trace (~3400 spans) I noticed:

  1. The performance of the aggregation is a bit heavy
  2. Sometimes there nodes where the count is 0 (same for duration, etc)
  3. Sometimes the time in the method is negative

Re 1. and 2., I think the change described in "Alternate aggregation" would resolve this.

Re 3., I think the current approach is to subtract the sum of children duration from the parent duration. One issue with this is it doesn't account for concurrency. For instance, if a parent is 1 second in duration and it spawns 10 children, in parallel, which are each 500ms and finish at the same time, the time in method will show -4 seconds instead of 500ms (roughly speaking). If I've got this wrong, let me know.

Changing to use intersecting ranges should resolve this, maybe something like node-drange might work?

What do you think?

Alternate aggregation

The change described below is only relevant if the aggregation is on a node-by-node basis, instead of service and operation.

On master, DagNode has a count: Number member which is the number of DenseSpans grouped into the DagNode.

For an experimental feature to do trace comparisons on span durations, I switched DagSpan to keep track of the DenseSpans it encapsulates instead of just the count:

parentID: ?NodeID;
id: NodeID;
// count: number;
members: DenseSpan[];
children: Set<NodeID>;

And, the corresponding change: TraceDag.js#L96-L99.

With this change, the members is an array of DenseSpan, which has a .span property, so the aggregation would be across .members[*].span for each node in the graph. I think then the aggregations can be calculated from the root nodes to the leaves, in a single pass, with efficient lookups, when necessary.

I might be inadvertently assuming prior knowledge or just plain doing a poor job communicating. So, let me know if I'm not making sense and I'll take another crack at it.

Lastly

As I was looking through the trace-diff and model/trace-dag/* code, I noticed the comments must be in the future, because they're certainly not in the code. (Doh! smh)

Thank you for taking the time to put this together! 🙏

@copa2
Copy link
Contributor Author

copa2 commented Nov 18, 2018

Well the big thank goes to you because I just added a minor calculation.
The big hard logic was already done in the tracediff code. 👌

Regrading your points:

  • Dropdown Button
    Done
  • Node aggregation / Alternate aggregation
    Great suggestion. Implemented the members and node aggregation. This makes code really a lot clearer and simpler.
  • Percent: Percent of total trace time and not self-time. I think both values have their validity. I decided for total time as values for self-time would always be 100% for leaf nodes. Total time helps me to find the big chunks.
    Here an experiment with an alternative view which shows percent in a progress. You see leaf is always 100%.

altopnode

  • Negative values in "self-time"
    For producer-consumer we have "FOLLOWS_FROM", which I already filter. So fork-join are "CHILD_OF" relationships? Do you have an example trace/span for fork-join? Otherwise I will model one for testing.
    I will try to adapt the code so that it will consider this.

@copa2 copa2 force-pushed the tracegraph branch 2 times, most recently from 4136e29 to db2ac3b Compare November 18, 2018 23:19
@copa2
Copy link
Contributor Author

copa2 commented Nov 21, 2018

Using now DRange for calculating "self-time". Fork-join spans should work but I have no real test data.

This is now the state of the last iteration:
jeageruitracegraph3

@copa2 copa2 force-pushed the tracegraph branch 4 times, most recently from 0627911 to a02e727 Compare November 23, 2018 16:51
@tiffon
Copy link
Member

tiffon commented Nov 24, 2018

@copa2 – Great changes!

Regarding self-time, in the previous version of the branch, the self-time for Driver::findNearest was a large negative (e.g. -133ms), but with the current version it's a small positive (e.g. 0.37ms), which is to be expected.

Previously - Large negative

self-time-negative

Current - Small positive

self-time-ok

To check things out, I hacked up an edit to the HotROD demo that makes the redis:GetDriver calls happen in parallel. The diff:

diff --git a/examples/hotrod/services/driver/server.go b/examples/hotrod/services/driver/server.go
index 274ceb7..5212d0c 100644
--- a/examples/hotrod/services/driver/server.go
+++ b/examples/hotrod/services/driver/server.go
@@ -15,6 +15,8 @@
 package driver
 
 import (
+	"sync"
+
 	"github.com/opentracing/opentracing-go"
 	"github.com/uber/jaeger-lib/metrics"
 	"github.com/uber/tchannel-go"
@@ -78,25 +80,35 @@ func (s *Server) FindNearest(ctx thrift.Context, location string) ([]*driver.Dri
 	driverIDs := s.redis.FindDriverIDs(ctx, location)
 
 	retMe := make([]*driver.DriverLocation, len(driverIDs))
+	wg := sync.WaitGroup{}
+	rvLock := sync.Mutex{}
 	for i, driverID := range driverIDs {
-		var drv Driver
-		var err error
-		for i := 0; i < 3; i++ {
-			drv, err = s.redis.GetDriver(ctx, driverID)
-			if err == nil {
-				break
+		wg.Add(1)
+		go func(driverID string, i int) {
+			var drv Driver
+			var err error
+			for i := 0; i < 100; i++ {
+				drv, err = s.redis.GetDriver(ctx, driverID)
+				if err == nil {
+					break
+				}
+				s.logger.For(ctx).Error("Retrying GetDriver after error", zap.Int("retry_no", i+1), zap.Error(err))
+			}
+			// if err != nil {
+			// 	s.logger.For(ctx).Error("Failed to get driver after 3 attempts", zap.Error(err))
+			// 	// wg.Done()
+			// 	return
+			// }
+			rvLock.Lock()
+			retMe[i] = &driver.DriverLocation{
+				DriverID: drv.DriverID,
+				Location: drv.Location,
 			}
-			s.logger.For(ctx).Error("Retrying GetDriver after error", zap.Int("retry_no", i+1), zap.Error(err))
-		}
-		if err != nil {
-			s.logger.For(ctx).Error("Failed to get driver after 3 attempts", zap.Error(err))
-			return nil, err
-		}
-		retMe[i] = &driver.DriverLocation{
-			DriverID: drv.DriverID,
-			Location: drv.Location,
-		}
+			rvLock.Unlock()
+			wg.Done()
+		}(driverID, i)
 	}
+	wg.Wait()
 	s.logger.For(ctx).Info("Search successful", zap.Int("num_drivers", len(retMe)))
 	return retMe, nil
 }

@yurishkuro
Copy link
Member

NB: the calls to route service are already parallel in HotROD. It would be nice to add a cmd line option to enable parallelism of the GetDriver calls as well, using a similar executor pool as for the route-svc.

@tiffon
Copy link
Member

tiffon commented Nov 24, 2018

@copa2 – Looks great! I'm really stoked about this diff!

And, glad the .members was helpful 👍

Looks great. I had kind of a long comment regarding the implementation of mode in the Trace Graph. The approach outlined is kind of but it's the result of DirectedGraph being pretty immature. The suggested approach is more typical of React, though. Please let know what you think (it's inline in OpNode.js).

Node metrics / aggregations

Currently, the metrics shown on the nodes are:

  1. TL (Top left) - Count
    • Number of spans contained in the node
  2. BL - Duration
    • Cumulative duration of the spans contained in the node
  3. TR - % of trace duration
  4. BR - Self-time
    • I believe this is cumulative (lmk if I'm wrong)

This very much gives a sense of scale for the node with regard to how it contributed to the trace, as a whole. And, it is possible to compare groups with one another, but mainly as "How much did this group contribute to the trace vs that group", so, still as the node relates to the full trace.

Since everything is the sum, the metrics are very sensitive to the count. And, that makes it tough to compare groups of different sizes.

What do you think of the following metrics?

  1. TL - Count
    • Number of spans contained in the node
    • Unchanged
  2. BL - % of trace duration
    • Sum of durations as a percent of the trace duration
    • Moved from TR
  3. TR - Average span duration
    • Sum of durations / count
  4. BR - Average % self-time
    • Average % self-time (distinct from sum of self-time / sum of durations)

The left-of-center metrics are cumulative and speak to the contribution to the trace as a whole. The right metrics are averages.

Seems like this might make it possible to compare groups against each other as well as see how a node contributes to the trace, overall.

Screenshot of a mock:

What do you think?

Landing the graph and following up with the metrics

Regarding which metrics to show in the node and which to base the color coding visualization on, seems like these warrant some exploration? I'm curious to hear from others, although gathering feedback on in-development UI features has been a bit over-looked.

Two questions:

How would you feel about presenting this work at the Jaeger bi-weekly to see if there is any feedback or suggestions?

OTOH, landing the Trace Graph view would be great. How would you feel about landing it with only the counts metric visible and the other metrics hidden behind a feature flag (query paramter)? That would allow some breathing room for gathering feedback and mulling over the metrics?

Awesome work! Thanks so much :)

@tiffon
Copy link
Member

tiffon commented Nov 25, 2018

@copa2 I just saw in #273 that metrics are a big part of your interests in adding this. No problem if it more sense to land the PR with metrics rather than breaking it into two phases.

Copy link
Member

@tiffon tiffon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I thought I submitted this feedback with the comment about the metrics!

@copa2
Copy link
Contributor Author

copa2 commented Nov 26, 2018

@tiffon - Thanks for the review. Really appreciate the constructive feedback.

Regarding your points:

Node metrics / aggregations

Why do I use sums? They are easy to understand :-)
There might be multiple children paths which influence the value.
E.g. An operation will call 5 children rpc for some input, but in another call with different input it will call 0 children.
So totaltime and self-time can be quite different and the averages are not meaningful.

I haven't tested with enough real data, so I am still trying to find out which data helps most.
If you see more value in averages I will change this.

Landing the graph and following up with the metrics

Color coding is currently on mode.
Service: Uses the same color scheme as used in the timeline for the same service.
Time: Red for more time used. 0-20% mapped to alpha values. This needs also more experimentation.

Personally I think the component is not quite ready for prime time. I am also unhappy with some points.
So let me try to go with another iteration.

More feedback and suggestions also from other people is appreciated. I hoped this would happen within this pull request.
I can also present this on a bi-weekly but not this friday(its my niece birthday). But if you want to show it feel free.

@tiffon
Copy link
Member

tiffon commented Nov 30, 2018

@copa2 Thanks for your response.

Re feedback in the PR, I agree that's preferable, but the bi-weekly will reach folks that might not typically respond.

I can't present at this weeks bi-weekly, either (Happy Birthday to your niece!). But, can you show it at the following bi-weekly? It's a great feature to showcase, and feedback will be useful regardless of whether the PR is merged to master by then, or not. If so, I'll add a spot to the agenda.

I'm OOO for a few days but will respond more fully when I return.

@pavolloffay
Copy link
Member

@copa2 this is a great work. We would like to include it in Jaeger soon. Are there any blockers for this to be finished?

@copa2
Copy link
Contributor Author

copa2 commented Dec 10, 2018

@pavolloffay / @tiffon
Sorry I couldn't invest the time I planned. So the current state is in WIP.

I see two possibilities:

  • Just apply minimum changes as reviewed by tiffon and release as MVP to get more feedback(could be done next weekend. I have no time this week).
  • Delay this to another release when we gathered more feedback from a bi-weekly.

To the current WIP which is more a lot of experimenting from me and not in a commitable state:
jaegeruitracegraph4

  • added sidebar for help and details

Also experimenting with different Node displays:
opnode1
opnode2
opnode3

@tiffon: This friday is bi-weekly. I will shortly showcase this WIP if you want.

@yurishkuro
Copy link
Member

It would be great to discuss at the status call. I am also in favor of merging something as experimental MVP and iterate on that.

@tiffon
Copy link
Member

tiffon commented Dec 11, 2018

@copa2 The progress and experiments look great!

I've added an item to the bi-weekly agenda. Looking forward to learning more!

@everett980
Copy link
Collaborator

@copa2 This change set looks great! I am also in favor of merging experimental work, pending two requests:

  1. Resolving this anti pattern. Parents invoking instance methods on descendants makes the code considerably harder to reason about and debug.
  2. Adding some form of disclaimer that this view is WIP, or perhaps hiding it behind a query param that indicates it is experimental. Once these experiments are more fleshed out the guard / disclaimer can be removed.

@yurishkuro
Copy link
Member

I am +1 on tagging a feature as "Experimental" (not WIP), but I am -1 on hiding it behind feature flags. Let's show it and have users raise questions. If we hide it we won't get any feedback.

@copa2
Copy link
Contributor Author

copa2 commented Dec 17, 2018

@tiffon / @yurishkuro / @everett980
Applied review changes. Please take a look if this ok.
This does not contain the NodeDetail and different alternative views. We can add them with another story later.
It currently does not contain an "Experimental" indicator. Not sure where to apply it on the view resp. where should I direct people when they want to give feedback. The issue and pull request are then closed. Should I open a new issue with "Enhancements to TraceGraph"?

Current view:
tracegraph4

@yurishkuro
Copy link
Member

I find the toggle button a bit confusing. I would prefer that it displays the name of the current view, and clicking on it opens a menu to select another view (including JSON view).

Should I open a new issue with "Enhancements to TraceGraph"?

Certainly let's keep future ideas recorded in a meta-ticket.

@copa2
Copy link
Contributor Author

copa2 commented Dec 17, 2018

I find the toggle button a bit confusing. I would prefer that it displays the name of the current view, and clicking on it opens a menu to select another view (including JSON view).

Well had this in the first version(see gif on top). Tiffon requested to change this his review:

Triggering this view
This is a great view, and the discoverability of a third option in "View Options" is pretty low. I think it makes sense to turn the View Options into a button with a dropdown menu, where the button is Trace Graph and the dropdown options are links to the JSON views. What do you think?

Copy link
Collaborator

@everett980 everett980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the nodeMode state up to TraceGraph instead of OpNode is great to see!
I have outlined a few, much-easier changes to make but then I believe it is ready to merge.

@copa2
Copy link
Contributor Author

copa2 commented Dec 17, 2018

Added experimental ribbon which links to #293
experimental

@copa2
Copy link
Contributor Author

copa2 commented Dec 17, 2018

@everett980: Thanks for your feedback. Applied your requested changes.

I will merge the commits into one commit when all reviews are ok.

Open points:

  • Add a TraceGraph to the TracePage #276 (comment). In my opinion it is needed. If there is no other feedback how solve it other I will mark it as resolved.
  • Should the entrypoint for the TraceGraph be a dropdown <-> button. Can you @tiffon / @yurishkuro discuss this together. In can live with both solutions :-)
    Here is something in between
    jaegeruitracegraphbtn
    So if you are ok with this new solution. I will commit this.

Copy link
Member

@tiffon tiffon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments. I'd say the main two are the #rrggbbaa only being supported by 80% or browsers and the DagNode#count comment.

Great work! 🎉

@tiffon
Copy link
Member

tiffon commented Dec 17, 2018

So if you are ok with this new solution. I will commit this.

The gif you showed looks great! Sorry for the mixed signals!

@copa2
Copy link
Contributor Author

copa2 commented Dec 17, 2018

Sorry for the mixed signals!

No problem. UIs are always a matter of taste :-) More discussion normaly bring better results.

Open now:

  • toFixed vs round2
  • count in DagNode

Will merge and rebase to master when reviews are through.

Copy link
Member

@tiffon tiffon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for adding the experimental link 👍

And...

Thanks for the awesome PR! 🎉

copa2 === BEAST_MODE_INCARNATE

Add alternative view in TracePage which allows
to see count, avg. time, total time and self time for
a given trace grouped by service and operation.

Signed-off-by: Patrick Coray <patrick.coray@bluewin.ch>
@copa2
Copy link
Contributor Author

copa2 commented Dec 18, 2018

Rebased and squashed commits.

@everett980: Let me know if you want further changes. Otherwise we should release this and apply improvements with #293.

😆 Altered Beast - my first game I remember was Galaga

@tiffon tiffon merged commit 0a36984 into jaegertracing:master Dec 18, 2018
everett980 pushed a commit to everett980/jaeger-ui that referenced this pull request Jan 16, 2019
Add alternative view in TracePage which allows
to see count, avg. time, total time and self time for
a given trace grouped by service and operation.

Signed-off-by: Patrick Coray <patrick.coray@bluewin.ch>
Signed-off-by: Everett Ross <reverett@uber.com>
vvvprabhakar pushed a commit to vvvprabhakar/jaeger-ui that referenced this pull request Jul 5, 2021
Add alternative view in TracePage which allows
to see count, avg. time, total time and self time for
a given trace grouped by service and operation.

Signed-off-by: Patrick Coray <patrick.coray@bluewin.ch>
Signed-off-by: vvvprabhakar <vvvprabhakar@gmail.com>
vvvprabhakar pushed a commit to vvvprabhakar/jaeger-ui that referenced this pull request Jul 5, 2021
Add alternative view in TracePage which allows
to see count, avg. time, total time and self time for
a given trace grouped by service and operation.

Signed-off-by: Patrick Coray <patrick.coray@bluewin.ch>
Signed-off-by: Everett Ross <reverett@uber.com>

Signed-off-by: vvvprabhakar <vvvprabhakar@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants