Add metrics query API spec #2946

albertteoh · 2021-04-20T00:37:41Z

Signed-off-by: albertteoh albert.teoh@logz.io

Which problem is this PR solving?

Please refer to "Monitor" tab for service health metrics #2954 for context on the use case.
Addresses review comments from Add metric def and metrics query endpoints jaeger-idl#73, one of which is to place the API definition in the main jaeger repository during the prototype phase.

Short description of the changes

Defines a vendor-neutral API specification for querying from a metrics backing store.
Add README to explain the motivation, approach taken and rationale for approach.
Update Makefile to generate client/server code from new proto files.
Tested on a working prototype with both gRPC and HTTP clients.

Signed-off-by: albertteoh <albert.teoh@logz.io>

codecov · 2021-04-20T00:43:08Z

Codecov Report

Merging #2946 (8a14cb7) into master (31a76d1) will decrease coverage by 0.07%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2946      +/-   ##
==========================================
- Coverage   95.97%   95.90%   -0.08%     
==========================================
  Files         223      223              
  Lines        9712     9712              
==========================================
- Hits         9321     9314       -7     
- Misses        323      328       +5     
- Partials       68       70       +2

Impacted Files	Coverage Δ
cmd/collector/app/server/zipkin.go	`61.53% <0.00%> (-15.39%)`	⬇️
pkg/config/tlscfg/cert_watcher.go	`92.20% <0.00%> (-2.60%)`	⬇️
cmd/query/app/static_handler.go	`95.16% <0.00%> (-1.62%)`	⬇️
plugin/storage/integration/integration.go	`77.90% <0.00%> (+0.55%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31a76d1...8a14cb7. Read the comment docs.

jpkrohling

I like this promise and how the service ended up looking like, but I can't judge the protobuf details.

yurishkuro · 2021-04-20T16:26:33Z

model/proto/metrics/service.proto

+  rpc GetMinStepDuration(GetMinStepDurationRequest) returns (GetMinStepDurationResponse);
+
+  // GetPerServiceLatencies gets latency metrics grouped by service.
+  rpc GetPerServiceLatencies(GetPerServiceLatenciesRequest) returns (GetMetricsResponse);


I still have a conceptual problem with this API.

First, it does not scale well. It will work fine for small shops with a dozen services. For a 100 services it becomes iffy - is it useful for a user to see 100 charts? For orgs with 1000s of services this does not work at all, it will probably choke on the data volume, and even if it doesn't, it's completely useless to return 1000s of time series to the UI.

Second, I don't think this is the right user workflow. The notion of a "service" is pretty loose, a serverless function can be a service, or an instance of ML model being served, so even small shops can easily get into a state with 100s and 1000s of "services". There's no user persona that need to see all this data all at once, nor see it ranked in a flat list, because the requirements for latency are very different for different services, it doesn't make sense to just sort and return top-K.

Why not start with having this API serve a "default dashboard" for a single service? That's a clear use case and a well understood user workflow. To extend this to multiple services requires some form of grouping so that we don't pull the data for all services at once.

Why not start with having this API serve a "default dashboard" for a single service?

The UI would then first get a list of services, decide which ones to place on the screen, and run queries passing the service they are interested in?

No, I am thinking the user will pick a service, the UI won't decide by itself. It's essentially how the search works today as well, so will be consistent behavior in the UI. I don't think it fits exactly the vision that Albert had, but per my comment above I don't think that vision scales well with # of services.

Instead of requiring the user to select a service to get useful data, we could preemptively show data about some services, even if only the 10 services with the most recent activity.

Thanks for your feedback @jpkrohling & @yurishkuro.

First, it does not scale well.

I agree, this will not scale for 100s+ services, especially for latency computations which is at least an O(n^3) problem. A single service-level view would reduce this to an O(n^2) problem.
As such, I'll remove those higher-level cross-service endpoints from the API, and we'll rethink the UI design to accommodate for this new workflow, where the user will select the service to display its list of operations' metrics or, as @jpkrohling suggested, return the most recent k services.

There's no user persona that need to see all this data all at once, nor see it ranked in a flat list, because the requirements for latency are very different for different services, it doesn't make sense to just sort and return top-K.

I think there are valid use cases for a higher-level view of services' metrics.

Some example use cases (feel free to refute if these examples seem spurious):

Post-deployment sanity checks: As a developer or devops engineer, I want to be assured that my deployment does not negatively impact other services (esp. those that are not immediate dependencies) and therefore the business in general.

As an engineer, I want to quickly pin-point the cause of a vague problem identified by customers (the website is slow) and avoid the back and forth conversation to find the slow service.

I agree that different services have different latency and perhaps even error profiles. Ideally, a change in error rate, latency, etc. or some sort of anomaly detection would have the most value (especially for the use cases listed above), though would be more complex and so this proposal would be a more reachable stepping stone towards that direction.

Yes, there's little value in returning all services, although I thought there is some value in sorting and returning top-K, especially if it's based on a delta of the metric. Again, this is compute intensive and would not be a scalable solution in this iteration.

Given the above, we will continue thinking of better ways to compute and present this higher-level service view to users in a scalable manner (e.g. post-processing a spoon-fed "view" of metrics) as we still believe it is a valuable feature and also welcome suggestions.

although I thought there is some value in sorting and returning top-K

My point was that comparing latencies across services is pretty meaningless. There can always be some kind of batch processing service triggered by an RPC from a cron job, whose latency will always be high, but it doesn't mean that it's abnormal. Whereas a critical shared service (like cache) whose latency went from 1ms to 2ms won't make the top-K but it's clearly a drastic regression.

I agree. The intention of the first set of PRs was to use this as a stepping-stone towards something more meaningful like showing the change in the metric or something more sophisticated like anomaly detection; given it's a bit more complex. Additionally, we included an "Impact" column in the proposed UI mockup that multiplies the latency with the call rate, which helps make it a bit more meaningful; though of course, a delta of the metrics would amplify the positive signal better I think.

Would you prefer that we tackle this now rather than later? That is, at least something simpler like if the response were to include a delta between the last data-point of the current time window (now-lookback, now] and the last data-point of the previous time window (now-lookback*2, now-lookback], or something similar?

Please let me know if there are any further questions/concerns that need addressing in this PR.

Given the positive reaction regarding the mockup, I would go with the minimal solution that works, release it, and iterate.

Thanks @jpkrohling :)

I've updated the API to support metrics aggregated at a service level and operation level, requiring the client to explicitly pass service names at all times with the following goals in mind:

Address the scalability concern by not returning all service metrics

Support the proposed "user preference" list of metrics grouped by service.

Support per-service set of metrics grouped by operation, while giving clients the flexibility to fetch more than one service's set of metrics concurrently upfront if they so wish, though expect the more common use case to be just a single service (the single service view with multiple RED metrics per operation).

Minimize "surface area" of the API, with little scope for submitting custom queries.

Simplify the API by minimizing the number of RPC endpoints.

Signed-off-by: albertteoh <albert.teoh@logz.io>

Pull request has been modified.

Signed-off-by: albertteoh <albert.teoh@logz.io>

jpkrohling · 2021-04-29T09:13:07Z

@albertteoh, ping us when this is ready for re-review!

Signed-off-by: albertteoh <albert.teoh@logz.io>

albertteoh · 2021-04-29T09:52:40Z

Thanks @jpkrohling, ready for re-review!

albertteoh · 2021-05-01T08:02:11Z

I'd like to move forward with this, please let me know if there are any outstanding issues and I'd be happy to address them.

yurishkuro

LGTM. You may want to tag this experimental somewhere until the actual backend functionality is built. I would recommend a master ticket with an execution plan checklist, like:

service IDL & data model
service impl that queries Prom
UI module that displays it
anything else?

yurishkuro · 2021-05-02T23:52:33Z

model/proto/metrics/service.proto

+// GetLatenciesRequest contains parameters for the GetLatencies RPC call.
+message GetLatenciesRequest {
+  MetricsQueryBaseRequest baseRequest = 1;
+  // quantile is the quantile to compute from latency histogram metrics.


Could use an example. What are the units? If we want p99, should this be quantile=99? If so, why the double type?

Could use an example. What are the units?

+1 I'll add an example.

If we want p99, should this be quantile=99? If so, why the double type?

The possible values range from 0-1; so p99 would be 0.99. Note that p99 is a percentile, which ranges from 0-100 and is a 100-quantile.

IIUC quantiles are a broader definition which equally slices the population of data up into any number. For example, p999 = 99.9th percentile and would have a quantile value of 0.999.

Moreover, most (if not all) metrics backends use quantiles instead of percentiles as parameters to their "histogram quantile" calculations, for instance:

InfluxDB

Prometheus, VictoriaMetrics and M3

Maintaining the 0-1 range of values for quantile means there is no need to perform any computation to map between percentiles to quantiles before passing it the metrics backend, as well as supporting quantiles more than 100, which would be a rare use case, but I feel there's no harm in doing so, as long as it's well documented.

yurishkuro · 2021-05-02T23:57:42Z

model/proto/metrics/service.proto

+}
+
+service MetricsQueryService {
+  // GetMinStepDuration gets the min step duration supported by the backing metrics store.


Suggested change

// GetMinStepDuration gets the min step duration supported by the backing metrics store.

// GetMinStepDuration gets the min time resolution supported by the backing metrics store,

// e.g. 10s means the backend can only return data points that are at least 10s apart, not closer.

Signed-off-by: albertteoh <albert.teoh@logz.io>

Pull request has been modified.

albertteoh · 2021-05-03T04:31:48Z

Thanks Yuri!

You may want to tag this experimental somewhere until the actual backend functionality is built.

Agreed.

I would recommend a master ticket with an execution plan checklist

Yup, I added a checklist in the this master ticket: #2946 and referenced this PR.

If you're happy with the recent changes, I'd appreciate another stamp since the mergify bot has dismissed your last approval.
Thanks, again!

jpkrohling

Thanks for this one, @albertteoh!

jpkrohling · 2021-05-03T09:37:59Z

model/proto/metrics/otelmetric.proto

+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Based on: https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/metrics/v1/metrics.proto


Would be to good to have a reference on the version used. Perhaps you can use a recent tag instead?

Good point! I've addressed this in #2975, could you please review it when you've got time?

jpkrohling · 2021-05-03T09:38:15Z

model/proto/metrics/README.md

+gogoproto.marshaler_all, gogoproto.unmarshaler_all, etc. enabled.
+
+Moreover, if direct imports of other repositories were possible, it would mean importing and generating code for
+transitive dependencies not required by Jaeger leading to longer build times, and potentially larger docker


s/docker/container

Add metrics API prototype spec

03888cf

Signed-off-by: albertteoh <albert.teoh@logz.io>

albertteoh requested a review from a team as a code owner April 20, 2021 00:37

albertteoh requested a review from vprithvi April 20, 2021 00:37

jpkrohling previously approved these changes Apr 20, 2021

View reviewed changes

yurishkuro reviewed Apr 20, 2021

View reviewed changes

Remove group-by service metric endpoints

64bf8af

Signed-off-by: albertteoh <albert.teoh@logz.io>

Support multiple service names in request

21ef46d

Signed-off-by: albertteoh <albert.teoh@logz.io>

albertteoh mentioned this pull request Apr 24, 2021

"Monitor" tab for service health metrics #2954

Closed

14 tasks

albertteoh and others added 3 commits April 26, 2021 18:55

Add group by operation and simplify rpc name

96346b7

Signed-off-by: albertteoh <albert.teoh@logz.io>

DRY service_names parameter

b247d03

Signed-off-by: albertteoh <albert.teoh@logz.io>

Merge branch 'master' into metrics-api-spec

1d4936c

Correct rpc method documentation

dcc6ca3

Signed-off-by: albertteoh <albert.teoh@logz.io>

yurishkuro previously approved these changes May 2, 2021

View reviewed changes

Address review comments

8a14cb7

Signed-off-by: albertteoh <albert.teoh@logz.io>

yurishkuro approved these changes May 3, 2021

View reviewed changes

albertteoh merged commit 870fc90 into jaegertracing:master May 3, 2021

albertteoh deleted the metrics-api-spec branch May 3, 2021 05:11

jpkrohling reviewed May 3, 2021

View reviewed changes

albertteoh mentioned this pull request May 3, 2021

Fix proto version #2975

Merged

jpkrohling added this to the Release 1.23.0 milestone Jun 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metrics query API spec #2946

Add metrics query API spec #2946

albertteoh commented Apr 20, 2021 •

edited by yurishkuro

Loading

codecov bot commented Apr 20, 2021 •

edited

Loading

jpkrohling left a comment

yurishkuro Apr 20, 2021

jpkrohling Apr 21, 2021

yurishkuro Apr 21, 2021

jpkrohling Apr 22, 2021

albertteoh Apr 22, 2021 •

edited

Loading

yurishkuro Apr 23, 2021

albertteoh Apr 23, 2021

albertteoh Apr 26, 2021

jpkrohling Apr 26, 2021

albertteoh Apr 26, 2021

jpkrohling commented Apr 29, 2021

albertteoh commented Apr 29, 2021

albertteoh commented May 1, 2021

yurishkuro left a comment

yurishkuro May 2, 2021

albertteoh May 3, 2021

yurishkuro May 2, 2021

albertteoh commented May 3, 2021

jpkrohling left a comment

jpkrohling May 3, 2021

albertteoh May 3, 2021

jpkrohling May 3, 2021

	// GetMinStepDuration gets the min step duration supported by the backing metrics store.
	// GetMinStepDuration gets the min time resolution supported by the backing metrics store,
	// e.g. 10s means the backend can only return data points that are at least 10s apart, not closer.

Add metrics query API spec #2946

Add metrics query API spec #2946

Conversation

albertteoh commented Apr 20, 2021 • edited by yurishkuro Loading

Which problem is this PR solving?

Short description of the changes

codecov bot commented Apr 20, 2021 • edited Loading

Codecov Report

jpkrohling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertteoh Apr 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpkrohling commented Apr 29, 2021

albertteoh commented Apr 29, 2021

albertteoh commented May 1, 2021

yurishkuro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertteoh commented May 3, 2021

jpkrohling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertteoh commented Apr 20, 2021 •

edited by yurishkuro

Loading

codecov bot commented Apr 20, 2021 •

edited

Loading

albertteoh Apr 22, 2021 •

edited

Loading