-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare insight metrics structure for adding service_name label #4227
Conversation
…s cache with respect to service_name
all_response_time_seconds = [int(response_time.total_seconds()) for response_time in all_response_time] | ||
|
||
metric_alert_group_response_time[integration.id] = { | ||
# count alert groups with `service_name` label group by label value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we tried these queries on production db? Do we need to apply any optimizations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the first call for response time is slow (>20s), subsequent calls work faster (~2s). I'm checking how to optimize this.
UPD: to speed up queries with service_name
label I added filter by organization, what makes them to use label table composite index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to optimize as possible (using the index improved things?). In any case, this shouldn't affect the response time in the scrape endpoint right? (there we are still getting the data from cache and returning it, correct?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to optimize as possible (using the index improved things?)
the slowest call from my tests is ~10s now. It uses readonly db, so I don't think this would be a problem 🤔
n any case, this shouldn't affect the response time in the scrape endpoint right? (there we are still getting the data from cache and returning it, correct?)
Correct, we get all data from the cache and don't do any calculations there
@@ -61,3 +66,6 @@ class RecalculateOrgMetricsDict(typing.TypedDict): | |||
|
|||
METRICS_ORGANIZATIONS_IDS = "metrics_organizations_ids" | |||
METRICS_ORGANIZATIONS_IDS_CACHE_TIMEOUT = 3600 # 1 hour | |||
|
|||
SERVICE_LABEL = "service_name" | |||
NO_SERVICE_VALUE = "No service" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, have we considered "Unnamed service" or "No name service" instead? ("No service" sounds as something broken? :-))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's like "No team", but for services :) And SLO uses "No service" as well, so it looks consistent
all_response_time_seconds = [int(response_time.total_seconds()) for response_time in all_response_time] | ||
|
||
metric_alert_group_response_time[integration.id] = { | ||
# count alert groups with `service_name` label group by label value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to optimize as possible (using the index improved things?). In any case, this shouldn't affect the response time in the scrape endpoint right? (there we are still getting the data from cache and returning it, correct?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. I would double-check metrics cache schema transition before releasing (and maybe add some extra test(s) for the update metrics cache helpers involving multiple services?)
# What this PR does Adds `service_name` label to insight metrics NOTE: It is related to [this PR](#4227) and should be merged no sooner than two days after the next release (current release version is 1.4.4), because we need to wait for the metrics cache to be updated for all organizations (uses the new cache structure with `services`) ## Which issue(s) this PR closes Related to grafana/oncall-private#2610 ## Checklist - [x] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - [x] Added the relevant release notes label (see labels prefixed w/ `release:`). These labels dictate how your PR will show up in the autogenerated release notes.
What this PR does
Prepare insight metrics for adding
service_name
label.This PR updates metrics cache structure, supporting both old and new version of cache.
service_name
label can be added with additional PR when all metric cache is updated.Which issue(s) this PR closes
https://github.com/grafana/oncall-private/issues/2610
Checklist
pr:no public docs
PR label added if not required)release:
). These labels dictate how your PR willshow up in the autogenerated release notes.