"Monitor" tab for service health metrics #2954

albertteoh · 2021-04-24T04:55:37Z

Proposed sub-tasks

Jaeger-Query

Owners: @albertteoh

Jaeger-UI

Owners: @th3M1ke

Approve UX & UI design (this issue)
Atm monitoring UI jaeger-ui#815: Create Monitoring Tab for Jaeger UI

Documentation

Owners: @albertteoh

Add metrics storage support to auto-generated CLI documentation documentation#539: Add usage documentation for metrics query API (noting that this is "experimental")
Add usage documentation for Monitor tab UI documentation#553: Add usage documentation for Monitor tab UI

Requirement - what kind of business use case are you trying to solve?

The main proposal is documented in: #2736.

The motivation is to help identify interesting traces (high qps, slow or erroneous) without knowing the service or operations up-front.

Use cases include:

Post deployment sanity checks across the org, or on known dependent services in the request chain.
Monitoring and root-causing when alerted of an issue.
Better onboarding experience for new users of Jaeger UI.
Long-term trend analysis of QPS, errors and latencies.
Capacity planning.

Proposal - what do you suggest to solve the problem or improve the existing situation?

Add a new "Monitoring" tab situated after "Compare" containing service-level request rates, error rates, latency and impact (= latency * request rate to avoid "false positives" from low QPS endpoints with high latencies).

The data will be sourced from jaeger-query's new metrics endpoints.

As the jaeger-query metrics endpoints require opt-in to be enabled, the Monitor tab will have a sensible empty state, perhaps a link to documentation on how to enable metrics querying capabilities.

Workflow

The screen will open to a per-service level set of metrics sorted, by default, on Impact. Columns are configurable by the user with other latency percentiles available, among others. A search box will be available to filter on service names.

The user need only supply the time period to fetch metrics on (similar to Find Traces), defaulting to a 1 hour lookback.

Note the user is not required to define the step size (the period between data points), at least in this iteration, to keep the user experience as simple as possible. Instead we propose to define the step size based on a sensible heuristic based on the query period and/or the width of the chart. For example:

< 30m search period -> 15s step
< 1h search period -> 1m step, etc.

There are two possible actions from here in this tab:

Click on a service to drill down to per-operation metrics.
Click on "View all traces" to go to the Search tab with the service pre-populated and Operation filter set to "all".

Service metrics page

If drilling down into the service-level metrics, the page will show a summary of the RED metrics at the top along with the per-operation equivalent metrics as with the per-service metrics above. Also similarly, there will be a search box to filter on operations, and the user has the option to "View all traces" for a given operation.

Search tab

The search tab will be the final stage in the workflow (except of course if going back to a previous state), which is pre-populated with the service and/or operation as well as the search period.

The search period will be sticky between each of these screens to maintain consistency in search results.

Demo

Courtesy of @Danafrid.

Screen.Recording.2021-04-14.at.11.52.58.mov

Any open questions to address

Any suggestions on charting libraries use for the larger detailed charts and the smaller row-level graphs in the table views?
Any requirement to maintain consistency with the trace statistics table view?
What is the preferred behaviour when a large number of services/operations are returned?
- Show the top n results ordered by Impact by default? What if the user sorts on a different metric like errors? Just sort on the current n results or refetch from jaeger-query?
- Show everything?
- Paginate? (probably want to avoid this as it would require maintaining state in UI or jaeger-query)

The text was updated successfully, but these errors were encountered:

jpkrohling · 2021-04-26T07:49:27Z

I love these mockups, and I can think of a couple of things to add after this is implemented. For instance, clicking on a part of the graph could lead people to the search results view for the relevant part they clicked. For instance, when clicking in the most recent part of the errors graph, the search results for "error=true" would be shown for the time window representing the area in the graph that the user clicked. Same for latency, returning the traces for that time window, sorted by latency, with the highest first.

In any case, I would recommend taking a look at Kiali, to see what they've done and what we could replicate. I particularly love the flame graphs, which should give an idea of how normal the latencies are for a given service/endpoint.

The images are from this blog post: Trace my mesh. Make sure to check the part 2 and part 3 too.

Any suggestions on charting libraries use for the larger detailed charts and the smaller row-level graphs in the table views?

I asked the Kiali folks to chime in, I'm sure they have some experience with this. In any case, I would ask to keep one feature in mind: the ability to embed those graphs by other solutions. I'd argue that a key aspect of the Jaeger UI is the fact that portions of it can be embedded into other applications, such as Kiali and Grafana Tempo.

Any suggestions on charting libraries use for the larger detailed charts and the smaller row-level graphs in the table views?

I would vote for the most conservative solution that pleases the biggest number of use cases. This is a new feature, we don't need to get it 100% right on our first try. I think sorting by the biggest impact would be the best solution.

What if the user sorts on a different metric like errors? Just sort on the current n results or refetch from jaeger-query?

Re-fetch.

Paginate? (probably want to avoid this as it would require maintaining state in UI or jaeger-query)

Can't we do an infinite scroll? The UI would then run the same query, for the same time window, just with a different offset.

th3M1ke · 2021-05-13T12:16:57Z

@yurishkuro @albertteoh Could I take the UI tasks on?

albertteoh · 2021-05-13T12:29:54Z

@th3M1ke yup, I've added you as an owner of the UI tasks. Thanks for your help!

RyanSiu1995 · 2021-09-17T02:12:20Z

Hi @albertteoh , anything can need the folks to make this released faster?
I can pick some if you have that.

albertteoh · 2021-09-17T02:17:03Z

@th3M1ke / @yoave23 how are you guys going with the UI side of the Monitor tab? Are there any well-defined/self-contained tasks that @RyanSiu1995 could help with?

I'll also need to complete the documentation but didn't want to work on that until the UI component is ready to avoid prematurely communicating the completion of this feature.

th3M1ke · 2021-09-20T09:59:29Z

Hi folks! We are about to finish. Covering features with tests. Hope will open PR by the end of this week

RyanSiu1995 · 2021-09-21T01:11:46Z

Thank you! Hope that it can come alive soon.

pranav-bhatt · 2021-10-15T12:14:31Z

Hi! Any updates on the release of this feature?

albertteoh · 2021-10-15T22:57:00Z

@th3M1ke is addressing feedback from jaegertracing/jaeger-ui#815, which should be the last major piece of work for this feature.

tianruyun · 2022-02-08T08:00:20Z

May I ask why the demand for #2954 has not been online yet?

albertteoh · 2022-02-09T12:09:24Z

May I ask why the demand for #2954 has not been online yet?

@tianruyun We're still testing this feature; resulting in some identified bugs/improvements either in OpenTelemetry collector (data from which the Monitor tab depends on) or Jaeger:

[prometheus exporter] Metric value overflow on metrics_expiration open-telemetry/opentelemetry-collector-contrib#6935: OpenTelemetry collector's prometheus exporter bug leading to a counter value overflow on metrics expiration.
[exporter/prometheusremotewrite] Collector crashes after metrics expiry open-telemetry/opentelemetry-collector-contrib#7149: OpenTelemetry collector's prometheus remote write exporter crashes on metrics expiration.
Monitor Tab: Support durations for duration query parameters #3522: Support duration strings for duration query parameters: lookback, ratePer and step.
Monitor Tab: Reduce ratePer window jaeger-ui#879: The rate duration defaults to 1 hr leading to low call rates if there's a short sample size of metrics, which can be a bit confusing. An easy fix is to change this to 10m to start with.

Those are the last remaining tasks so far before we can make this feature available in Jaeger. I plan to find some time to work on documentation in parallel with the remaining two Jaeger tasks above. You're most welcome to provide contributions as well. 😄

schickling · 2022-02-16T22:26:19Z

Really excited for this. Is there a way to already give this a try on the latest release?

albertteoh · 2022-02-19T01:44:20Z

@schickling there's demo you can run locally in: https://github.com/jaegertracing/jaeger/tree/main/docker-compose/monitor.

albertteoh · 2022-03-17T10:29:39Z

All tasks complete, closing this issue. We can address any feedback/bugs as separate Issues/PRs.

FingerLiu · 2022-03-22T11:29:52Z

Waiting to use in production! When will this release?

albertteoh · 2022-03-30T09:30:35Z

Waiting to use in production! When will this release?

Next release planned for 6 April: https://github.com/jaegertracing/jaeger/blob/main/RELEASE.md#release-managers

FingerLiu · 2022-04-12T13:28:02Z

Hi @albertteoh it seems that Monitor Tab still not in the new release 1.33 https://github.com/jaegertracing/jaeger/releases/tag/v1.33.0

What obstacle did we met?

albertteoh · 2022-04-12T20:05:42Z

@FingerLiu the main functionality for Monitor Tab was essentially completed in previous releases. This is why you couldn't see references to the Monitor Tab in the jaeger v1.33.0 release notes, which mainly emphasise changes to the Jaeger backend components.

Most of the remaining changes needed for the Monitor Tab work to be considered "done" were documentation and other frontend bug enhancements/fixes, and these were both released together as part of the jaeger v1.33.0 release (NB: I've updated these release notes to include the pinned Jaeger UI version).

albertteoh self-assigned this Apr 24, 2021

albertteoh mentioned this issue Apr 24, 2021

"Monitor" tab for health metrics jaegertracing/jaeger-ui#736

Closed

jkowall mentioned this issue Apr 24, 2021

Roadmap updates per meeting this week jaegertracing/documentation#498

Closed

yurishkuro mentioned this issue Apr 25, 2021

Add metrics query API spec #2946

Merged

This was referenced May 4, 2021

Add metrics reader interface and gen proto #2977

Merged

Add prom and m3 storage implementation skeleton #2983

Merged

albertteoh mentioned this issue May 12, 2021

Implement Prometheus metrics reader constructor #2988

Merged

albertteoh assigned th3M1ke May 13, 2021

th3M1ke mentioned this issue May 13, 2021

Create a monitor tab for ATM jaegertracing/jaeger-ui#753

Open

albertteoh mentioned this issue May 17, 2021

Implement metrics reader #3004

Merged

This was referenced Jun 13, 2021

Convert MetricsQueryService to interface #3089

Merged

Add MetricsQueryService grcp handler #3091

Merged

Add HTTP handler for metrics querying #3095

Merged

albertteoh mentioned this issue Jun 25, 2021

Metrics Query Development Environment #3107

Closed

albertteoh mentioned this issue Oct 26, 2021

Create a metrics query endpoint #2736

Closed

albertteoh mentioned this issue Nov 30, 2021

Add metrics storage support to auto-generated CLI documentation jaegertracing/documentation#539

Merged

This was referenced Feb 1, 2022

Monitor Tab: Numbers in the Legend jaegertracing/jaeger-ui#871

Closed

Jaeger v2 based on OpenTelemetry collector #3500

Closed

albertteoh mentioned this issue Feb 8, 2022

Monitor Tab: Support durations for duration query parameters #3522

Closed

albertteoh mentioned this issue Mar 2, 2022

Add usage documentation for Monitor tab UI jaegertracing/documentation#553

Closed

albertteoh closed this as completed Mar 17, 2022

albertteoh mentioned this issue Mar 22, 2022

Remove ATM from roadmap jaegertracing/documentation#564

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Monitor" tab for service health metrics #2954

"Monitor" tab for service health metrics #2954

albertteoh commented Apr 24, 2021 •

edited

Loading

jpkrohling commented Apr 26, 2021

th3M1ke commented May 13, 2021

albertteoh commented May 13, 2021

RyanSiu1995 commented Sep 17, 2021

albertteoh commented Sep 17, 2021

th3M1ke commented Sep 20, 2021

RyanSiu1995 commented Sep 21, 2021

pranav-bhatt commented Oct 15, 2021

albertteoh commented Oct 15, 2021

tianruyun commented Feb 8, 2022

albertteoh commented Feb 9, 2022

schickling commented Feb 16, 2022

albertteoh commented Feb 19, 2022

albertteoh commented Mar 17, 2022

FingerLiu commented Mar 22, 2022

albertteoh commented Mar 30, 2022

FingerLiu commented Apr 12, 2022

albertteoh commented Apr 12, 2022 •

edited

Loading

"Monitor" tab for service health metrics #2954

"Monitor" tab for service health metrics #2954

Comments

albertteoh commented Apr 24, 2021 • edited Loading

Proposed sub-tasks

Jaeger-Query

Jaeger-UI

Documentation

Requirement - what kind of business use case are you trying to solve?

Proposal - what do you suggest to solve the problem or improve the existing situation?

Workflow

Service metrics page

Search tab

Demo

Any open questions to address

jpkrohling commented Apr 26, 2021

th3M1ke commented May 13, 2021

albertteoh commented May 13, 2021

RyanSiu1995 commented Sep 17, 2021

albertteoh commented Sep 17, 2021

th3M1ke commented Sep 20, 2021

RyanSiu1995 commented Sep 21, 2021

pranav-bhatt commented Oct 15, 2021

albertteoh commented Oct 15, 2021

tianruyun commented Feb 8, 2022

albertteoh commented Feb 9, 2022

schickling commented Feb 16, 2022

albertteoh commented Feb 19, 2022

albertteoh commented Mar 17, 2022

FingerLiu commented Mar 22, 2022

albertteoh commented Mar 30, 2022

FingerLiu commented Apr 12, 2022

albertteoh commented Apr 12, 2022 • edited Loading

albertteoh commented Apr 24, 2021 •

edited

Loading

albertteoh commented Apr 12, 2022 •

edited

Loading