-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics query API spec #2946
Conversation
Signed-off-by: albertteoh <albert.teoh@logz.io>
Codecov Report
@@ Coverage Diff @@
## master #2946 +/- ##
==========================================
- Coverage 95.97% 95.90% -0.08%
==========================================
Files 223 223
Lines 9712 9712
==========================================
- Hits 9321 9314 -7
- Misses 323 328 +5
- Partials 68 70 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this promise and how the service ended up looking like, but I can't judge the protobuf details.
model/proto/metrics/service.proto
Outdated
rpc GetMinStepDuration(GetMinStepDurationRequest) returns (GetMinStepDurationResponse); | ||
|
||
// GetPerServiceLatencies gets latency metrics grouped by service. | ||
rpc GetPerServiceLatencies(GetPerServiceLatenciesRequest) returns (GetMetricsResponse); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have a conceptual problem with this API.
First, it does not scale well. It will work fine for small shops with a dozen services. For a 100 services it becomes iffy - is it useful for a user to see 100 charts? For orgs with 1000s of services this does not work at all, it will probably choke on the data volume, and even if it doesn't, it's completely useless to return 1000s of time series to the UI.
Second, I don't think this is the right user workflow. The notion of a "service" is pretty loose, a serverless function can be a service, or an instance of ML model being served, so even small shops can easily get into a state with 100s and 1000s of "services". There's no user persona that need to see all this data all at once, nor see it ranked in a flat list, because the requirements for latency are very different for different services, it doesn't make sense to just sort and return top-K.
Why not start with having this API serve a "default dashboard" for a single service? That's a clear use case and a well understood user workflow. To extend this to multiple services requires some form of grouping so that we don't pull the data for all services at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not start with having this API serve a "default dashboard" for a single service?
The UI would then first get a list of services, decide which ones to place on the screen, and run queries passing the service they are interested in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I am thinking the user will pick a service, the UI won't decide by itself. It's essentially how the search works today as well, so will be consistent behavior in the UI. I don't think it fits exactly the vision that Albert had, but per my comment above I don't think that vision scales well with # of services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of requiring the user to select a service to get useful data, we could preemptively show data about some services, even if only the 10 services with the most recent activity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feedback @jpkrohling & @yurishkuro.
First, it does not scale well.
I agree, this will not scale for 100s+ services, especially for latency computations which is at least an O(n^3) problem. A single service-level view would reduce this to an O(n^2) problem.
As such, I'll remove those higher-level cross-service endpoints from the API, and we'll rethink the UI design to accommodate for this new workflow, where the user will select the service to display its list of operations' metrics or, as @jpkrohling suggested, return the most recent k services.
There's no user persona that need to see all this data all at once, nor see it ranked in a flat list, because the requirements for latency are very different for different services, it doesn't make sense to just sort and return top-K.
I think there are valid use cases for a higher-level view of services' metrics.
Some example use cases (feel free to refute if these examples seem spurious):
- Post-deployment sanity checks: As a developer or devops engineer, I want to be assured that my deployment does not negatively impact other services (esp. those that are not immediate dependencies) and therefore the business in general.
- As an engineer, I want to quickly pin-point the cause of a vague problem identified by customers (the website is slow) and avoid the back and forth conversation to find the slow service.
I agree that different services have different latency and perhaps even error profiles. Ideally, a change in error rate, latency, etc. or some sort of anomaly detection would have the most value (especially for the use cases listed above), though would be more complex and so this proposal would be a more reachable stepping stone towards that direction.
Yes, there's little value in returning all services, although I thought there is some value in sorting and returning top-K, especially if it's based on a delta of the metric. Again, this is compute intensive and would not be a scalable solution in this iteration.
Given the above, we will continue thinking of better ways to compute and present this higher-level service view to users in a scalable manner (e.g. post-processing a spoon-fed "view" of metrics) as we still believe it is a valuable feature and also welcome suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although I thought there is some value in sorting and returning top-K
My point was that comparing latencies across services is pretty meaningless. There can always be some kind of batch processing service triggered by an RPC from a cron job, whose latency will always be high, but it doesn't mean that it's abnormal. Whereas a critical shared service (like cache) whose latency went from 1ms to 2ms won't make the top-K but it's clearly a drastic regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. The intention of the first set of PRs was to use this as a stepping-stone towards something more meaningful like showing the change in the metric or something more sophisticated like anomaly detection; given it's a bit more complex. Additionally, we included an "Impact" column in the proposed UI mockup that multiplies the latency with the call rate, which helps make it a bit more meaningful; though of course, a delta of the metrics would amplify the positive signal better I think.
Would you prefer that we tackle this now rather than later? That is, at least something simpler like if the response were to include a delta between the last data-point of the current time window (now-lookback, now]
and the last data-point of the previous time window (now-lookback*2, now-lookback]
, or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let me know if there are any further questions/concerns that need addressing in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the positive reaction regarding the mockup, I would go with the minimal solution that works, release it, and iterate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jpkrohling :)
I've updated the API to support metrics aggregated at a service level and operation level, requiring the client to explicitly pass service names at all times with the following goals in mind:
- Address the scalability concern by not returning all service metrics
- Support the proposed "user preference" list of metrics grouped by service.
- Support per-service set of metrics grouped by operation, while giving clients the flexibility to fetch more than one service's set of metrics concurrently upfront if they so wish, though expect the more common use case to be just a single service (the single service view with multiple RED metrics per operation).
- Minimize "surface area" of the API, with little scope for submitting custom queries.
- Simplify the API by minimizing the number of RPC endpoints.
Signed-off-by: albertteoh <albert.teoh@logz.io>
Pull request has been modified.
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
@albertteoh, ping us when this is ready for re-review! |
Signed-off-by: albertteoh <albert.teoh@logz.io>
Thanks @jpkrohling, ready for re-review! |
I'd like to move forward with this, please let me know if there are any outstanding issues and I'd be happy to address them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. You may want to tag this experimental somewhere until the actual backend functionality is built. I would recommend a master ticket with an execution plan checklist, like:
- service IDL & data model
- service impl that queries Prom
- UI module that displays it
- anything else?
// GetLatenciesRequest contains parameters for the GetLatencies RPC call. | ||
message GetLatenciesRequest { | ||
MetricsQueryBaseRequest baseRequest = 1; | ||
// quantile is the quantile to compute from latency histogram metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use an example. What are the units? If we want p99, should this be quantile=99
? If so, why the double type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use an example. What are the units?
+1 I'll add an example.
If we want p99, should this be quantile=99? If so, why the double type?
The possible values range from 0-1; so p99 would be 0.99
. Note that p99 is a percentile, which ranges from 0-100 and is a 100-quantile.
IIUC quantiles are a broader definition which equally slices the population of data up into any number. For example, p999 = 99.9th percentile and would have a quantile value of 0.999.
Moreover, most (if not all) metrics backends use quantiles instead of percentiles as parameters to their "histogram quantile" calculations, for instance:
- InfluxDB
- Prometheus, VictoriaMetrics and M3
Maintaining the 0-1 range of values for quantile means there is no need to perform any computation to map between percentiles to quantiles before passing it the metrics backend, as well as supporting quantiles more than 100, which would be a rare use case, but I feel there's no harm in doing so, as long as it's well documented.
model/proto/metrics/service.proto
Outdated
} | ||
|
||
service MetricsQueryService { | ||
// GetMinStepDuration gets the min step duration supported by the backing metrics store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// GetMinStepDuration gets the min step duration supported by the backing metrics store. | |
// GetMinStepDuration gets the min time resolution supported by the backing metrics store, | |
// e.g. 10s means the backend can only return data points that are at least 10s apart, not closer. |
Signed-off-by: albertteoh <albert.teoh@logz.io>
Thanks Yuri!
Agreed.
Yup, I added a checklist in the this master ticket: #2946 and referenced this PR. If you're happy with the recent changes, I'd appreciate another stamp since the mergify bot has dismissed your last approval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this one, @albertteoh!
// See the License for the specific language governing permissions and | ||
// limitations under the License. | ||
|
||
// Based on: https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/metrics/v1/metrics.proto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be to good to have a reference on the version used. Perhaps you can use a recent tag instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I've addressed this in #2975, could you please review it when you've got time?
gogoproto.marshaler_all, gogoproto.unmarshaler_all, etc. enabled. | ||
|
||
Moreover, if direct imports of other repositories were possible, it would mean importing and generating code for | ||
transitive dependencies not required by Jaeger leading to longer build times, and potentially larger docker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/docker/container
Signed-off-by: albertteoh albert.teoh@logz.io
Which problem is this PR solving?
Short description of the changes