Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare insight metrics structure for adding service_name label #4227

Merged
merged 18 commits into from
Apr 29, 2024

Conversation

Ferril
Copy link
Member

@Ferril Ferril commented Apr 15, 2024

What this PR does

Prepare insight metrics for adding service_name label.
This PR updates metrics cache structure, supporting both old and new version of cache.
service_name label can be added with additional PR when all metric cache is updated.

Which issue(s) this PR closes

https://github.com/grafana/oncall-private/issues/2610

Checklist

  • Unit, integration, and e2e (if applicable) tests updated
  • Documentation added (or pr:no public docs PR label added if not required)
  • Added the relevant release notes label (see labels prefixed w/ release:). These labels dictate how your PR will
    show up in the autogenerated release notes.

@Ferril Ferril added the pr:no public docs Added to a PR that does not require public documentation updates label Apr 19, 2024
@Ferril Ferril added the release:ignore PR will not be added to release notes label Apr 22, 2024
@Ferril Ferril changed the title Draft for adding service_name label to insight metrics Prepare insight metrics structure for adding service_name label Apr 22, 2024
@Ferril Ferril marked this pull request as ready for review April 22, 2024 08:23
@Ferril Ferril requested a review from a team as a code owner April 22, 2024 08:23
engine/apps/metrics_exporter/helpers.py Outdated Show resolved Hide resolved
engine/apps/metrics_exporter/helpers.py Outdated Show resolved Hide resolved
all_response_time_seconds = [int(response_time.total_seconds()) for response_time in all_response_time]

metric_alert_group_response_time[integration.id] = {
# count alert groups with `service_name` label group by label value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tried these queries on production db? Do we need to apply any optimizations?

Copy link
Member Author

@Ferril Ferril Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the first call for response time is slow (>20s), subsequent calls work faster (~2s). I'm checking how to optimize this.
UPD: to speed up queries with service_name label I added filter by organization, what makes them to use label table composite index

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to optimize as possible (using the index improved things?). In any case, this shouldn't affect the response time in the scrape endpoint right? (there we are still getting the data from cache and returning it, correct?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to optimize as possible (using the index improved things?)

the slowest call from my tests is ~10s now. It uses readonly db, so I don't think this would be a problem 🤔

n any case, this shouldn't affect the response time in the scrape endpoint right? (there we are still getting the data from cache and returning it, correct?)

Correct, we get all data from the cache and don't do any calculations there

@@ -61,3 +66,6 @@ class RecalculateOrgMetricsDict(typing.TypedDict):

METRICS_ORGANIZATIONS_IDS = "metrics_organizations_ids"
METRICS_ORGANIZATIONS_IDS_CACHE_TIMEOUT = 3600 # 1 hour

SERVICE_LABEL = "service_name"
NO_SERVICE_VALUE = "No service"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, have we considered "Unnamed service" or "No name service" instead? ("No service" sounds as something broken? :-))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's like "No team", but for services :) And SLO uses "No service" as well, so it looks consistent

engine/apps/metrics_exporter/constants.py Outdated Show resolved Hide resolved
engine/apps/metrics_exporter/metrics_cache_manager.py Outdated Show resolved Hide resolved
engine/apps/metrics_exporter/helpers.py Show resolved Hide resolved
all_response_time_seconds = [int(response_time.total_seconds()) for response_time in all_response_time]

metric_alert_group_response_time[integration.id] = {
# count alert groups with `service_name` label group by label value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to optimize as possible (using the index improved things?). In any case, this shouldn't affect the response time in the scrape endpoint right? (there we are still getting the data from cache and returning it, correct?)

Copy link
Contributor

@matiasb matiasb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. I would double-check metrics cache schema transition before releasing (and maybe add some extra test(s) for the update metrics cache helpers involving multiple services?)

@Ferril Ferril added this pull request to the merge queue Apr 29, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 29, 2024
@Ferril Ferril added this pull request to the merge queue Apr 29, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 29, 2024
@Ferril Ferril added this pull request to the merge queue Apr 29, 2024
Merged via the queue into dev with commit d1085b7 Apr 29, 2024
21 checks passed
@Ferril Ferril deleted the add-service-name-label-to-metrics branch April 29, 2024 09:52
github-merge-queue bot pushed a commit that referenced this pull request May 22, 2024
# What this PR does
Adds `service_name` label to insight metrics
NOTE: It is related to [this
PR](#4227) and should be merged no
sooner than two days after the next release (current release version is
1.4.4), because we need to wait for the metrics cache to be updated for
all organizations (uses the new cache structure with `services`)

## Which issue(s) this PR closes
Related to grafana/oncall-private#2610

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] Added the relevant release notes label (see labels prefixed w/
`release:`). These labels dictate how your PR will
    show up in the autogenerated release notes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr:no public docs Added to a PR that does not require public documentation updates release:ignore PR will not be added to release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants