Prepare insight metrics structure for adding service_name label #4227

Ferril · 2024-04-15T17:07:56Z

What this PR does

Prepare insight metrics for adding service_name label.
This PR updates metrics cache structure, supporting both old and new version of cache.
service_name label can be added with additional PR when all metric cache is updated.

Which issue(s) this PR closes

https://github.com/grafana/oncall-private/issues/2610

Checklist

Unit, integration, and e2e (if applicable) tests updated
Documentation added (or pr:no public docs PR label added if not required)
Added the relevant release notes label (see labels prefixed w/ release:). These labels dictate how your PR will
show up in the autogenerated release notes.

…s cache with respect to service_name

engine/apps/metrics_exporter/helpers.py

iskhakov · 2024-04-22T11:46:05Z

engine/apps/metrics_exporter/tasks.py

-        all_response_time_seconds = [int(response_time.total_seconds()) for response_time in all_response_time]
-
-        metric_alert_group_response_time[integration.id] = {
+            # count alert groups with `service_name` label group by label value


Have we tried these queries on production db? Do we need to apply any optimizations?

Yes, the first call for response time is slow (>20s), subsequent calls work faster (~2s). I'm checking how to optimize this.
UPD: to speed up queries with service_name label I added filter by organization, what makes them to use label table composite index

+1 to optimize as possible (using the index improved things?). In any case, this shouldn't affect the response time in the scrape endpoint right? (there we are still getting the data from cache and returning it, correct?)

+1 to optimize as possible (using the index improved things?)

the slowest call from my tests is ~10s now. It uses readonly db, so I don't think this would be a problem 🤔

n any case, this shouldn't affect the response time in the scrape endpoint right? (there we are still getting the data from cache and returning it, correct?)

Correct, we get all data from the cache and don't do any calculations there

matiasb · 2024-04-24T20:49:49Z

engine/apps/metrics_exporter/constants.py

@@ -61,3 +66,6 @@ class RecalculateOrgMetricsDict(typing.TypedDict):

 METRICS_ORGANIZATIONS_IDS = "metrics_organizations_ids"
 METRICS_ORGANIZATIONS_IDS_CACHE_TIMEOUT = 3600  # 1 hour
+
+SERVICE_LABEL = "service_name"
+NO_SERVICE_VALUE = "No service"


Just curious, have we considered "Unnamed service" or "No name service" instead? ("No service" sounds as something broken? :-))

It's like "No team", but for services :) And SLO uses "No service" as well, so it looks consistent

engine/apps/metrics_exporter/constants.py

engine/apps/metrics_exporter/metrics_cache_manager.py

engine/apps/metrics_exporter/helpers.py

matiasb · 2024-04-24T21:25:00Z

engine/apps/metrics_exporter/tasks.py

-        all_response_time_seconds = [int(response_time.total_seconds()) for response_time in all_response_time]
-
-        metric_alert_group_response_time[integration.id] = {
+            # count alert groups with `service_name` label group by label value


+1 to optimize as possible (using the index improved things?). In any case, this shouldn't affect the response time in the scrape endpoint right? (there we are still getting the data from cache and returning it, correct?)

matiasb

Makes sense to me. I would double-check metrics cache schema transition before releasing (and maybe add some extra test(s) for the update metrics cache helpers involving multiple services?)

# What this PR does Adds `service_name` label to insight metrics NOTE: It is related to [this PR](#4227) and should be merged no sooner than two days after the next release (current release version is 1.4.4), because we need to wait for the metrics cache to be updated for all organizations (uses the new cache structure with `services`) ## Which issue(s) this PR closes Related to grafana/oncall-private#2610 ## Checklist - [x] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - [x] Added the relevant release notes label (see labels prefixed w/ `release:`). These labels dictate how your PR will show up in the autogenerated release notes.

Draft for adding service_name label to insight metrics

4728a00

Ferril temporarily deployed to github-pages April 15, 2024 17:08 — with GitHub Actions Inactive

Add support for service label for response time metric, update metric…

f76dbc6

…s cache with respect to service_name

Ferril temporarily deployed to github-pages April 17, 2024 08:34 — with GitHub Actions Inactive

Use constants for service key name and value

c70c714

Ferril temporarily deployed to github-pages April 17, 2024 13:48 — with GitHub Actions Inactive

Add support for old metrics cache

c554712

Ferril temporarily deployed to github-pages April 18, 2024 11:23 — with GitHub Actions Inactive

Fix update alert group state metric

45d3c5e

Ferril temporarily deployed to github-pages April 18, 2024 12:19 — with GitHub Actions Inactive

Update constants

e2da2ac

Ferril temporarily deployed to github-pages April 18, 2024 14:07 — with GitHub Actions Inactive

Update tests, add tests for backward compatability

c848775

Ferril temporarily deployed to github-pages April 19, 2024 13:06 — with GitHub Actions Inactive

Ferril added the pr:no public docs Added to a PR that does not require public documentation updates label Apr 19, 2024

Fix typing

028ca7c

Ferril had a problem deploying to github-pages April 19, 2024 14:05 — with GitHub Actions Error

Update typing

b8fbbdb

Ferril temporarily deployed to github-pages April 19, 2024 14:09 — with GitHub Actions Inactive

Ferril added the release:ignore PR will not be added to release notes label Apr 22, 2024

Ferril changed the title ~~Draft for adding service_name label to insight metrics~~ Prepare insight metrics structure for adding service_name label Apr 22, 2024

Update test

7db9f65

Ferril temporarily deployed to github-pages April 22, 2024 08:23 — with GitHub Actions Inactive

Ferril marked this pull request as ready for review April 22, 2024 08:23

Ferril requested a review from a team as a code owner April 22, 2024 08:23

iskhakov reviewed Apr 22, 2024

View reviewed changes

Update comments and typing

20ee8d5

Ferril had a problem deploying to github-pages April 22, 2024 13:51 — with GitHub Actions Failure

mypy fix

52576ee

Ferril temporarily deployed to github-pages April 22, 2024 15:41 — with GitHub Actions Inactive

Optimize queries for alert group metrics calculation

237c186

Ferril temporarily deployed to github-pages April 23, 2024 09:18 — with GitHub Actions Inactive

Fix queries

f5fe428

Ferril temporarily deployed to github-pages April 23, 2024 10:34 — with GitHub Actions Inactive

Ferril requested a review from iskhakov April 23, 2024 12:23

Optimize metric calculation query, update test

c0b21e5

Ferril temporarily deployed to github-pages April 24, 2024 09:06 — with GitHub Actions Inactive

matiasb reviewed Apr 25, 2024

View reviewed changes

Fix cache update and metric constant

ff6af8b

Ferril temporarily deployed to github-pages April 25, 2024 09:18 — with GitHub Actions Inactive

Ferril requested a review from matiasb April 25, 2024 09:38

matiasb approved these changes Apr 25, 2024

View reviewed changes

Update tests

5c92f6e

Ferril temporarily deployed to github-pages April 26, 2024 09:51 — with GitHub Actions Inactive

Ferril enabled auto-merge April 29, 2024 08:17

Ferril disabled auto-merge April 29, 2024 08:21

Merge branch 'dev' into add-service-name-label-to-metrics

088ca0b

Ferril temporarily deployed to github-pages April 29, 2024 08:22 — with GitHub Actions Inactive

Ferril added this pull request to the merge queue Apr 29, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 29, 2024

Ferril added this pull request to the merge queue Apr 29, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 29, 2024

iskhakov approved these changes Apr 29, 2024

View reviewed changes

Ferril added this pull request to the merge queue Apr 29, 2024

Merged via the queue into dev with commit d1085b7 Apr 29, 2024
21 checks passed

Ferril deleted the add-service-name-label-to-metrics branch April 29, 2024 09:52

Ferril mentioned this pull request Apr 30, 2024

Add service_name label to insight metrics #4300

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare insight metrics structure for adding service_name label #4227

Prepare insight metrics structure for adding service_name label #4227

Ferril commented Apr 15, 2024 •

edited

Loading

iskhakov Apr 22, 2024

Ferril Apr 23, 2024 •

edited

Loading

matiasb Apr 24, 2024

Ferril Apr 25, 2024

matiasb Apr 24, 2024

Ferril Apr 26, 2024

matiasb Apr 24, 2024

matiasb left a comment

Prepare insight metrics structure for adding service_name label #4227

Prepare insight metrics structure for adding service_name label #4227

Conversation

Ferril commented Apr 15, 2024 • edited Loading

What this PR does

Which issue(s) this PR closes

Checklist

iskhakov Apr 22, 2024

Choose a reason for hiding this comment

Ferril Apr 23, 2024 • edited Loading

Choose a reason for hiding this comment

matiasb Apr 24, 2024

Choose a reason for hiding this comment

Ferril Apr 25, 2024

Choose a reason for hiding this comment

matiasb Apr 24, 2024

Choose a reason for hiding this comment

Ferril Apr 26, 2024

Choose a reason for hiding this comment

matiasb Apr 24, 2024

Choose a reason for hiding this comment

matiasb left a comment

Choose a reason for hiding this comment

Ferril commented Apr 15, 2024 •

edited

Loading

Ferril Apr 23, 2024 •

edited

Loading