Add summary stats and metrics for stats cache #9536

salvacorts · 2023-05-26T14:32:02Z

What this PR does / why we need it:
When a query finishes, we return (and log) the following stats:

Cache.Chunk.Requests             0
Cache.Chunk.EntriesRequested     0
Cache.Chunk.EntriesFound         0
Cache.Chunk.EntriesStored        0
Cache.Chunk.BytesSent            0 B
Cache.Chunk.BytesReceived        0 B
Cache.Chunk.DownloadTime         0s
Cache.Index.Requests             0
Cache.Index.EntriesRequested     0
Cache.Index.EntriesFound         0
Cache.Index.EntriesStored        0
Cache.Index.BytesSent            0 B
Cache.Index.BytesReceived        0 B
Cache.Index.DownloadTime         0s
Cache.Result.Requests            13
Cache.Result.EntriesRequested    13
Cache.Result.EntriesFound        13
Cache.Result.EntriesStored       0
Cache.Result.BytesSent   0 B
Cache.Result.BytesReceived       2.5 kB
Cache.Result.DownloadTime        4.600266ms

In addition to that, we log the following in metrics.go:

level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester
|promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim
it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes
_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms

With the goal of being able to better monitor how the stats cache is performing; this PR adds stats for the index stats cache, similarly to how it's done for the results cache.

Here's an example of the new stats being returned and printed:

...
Cache.StatsResult.Requests               180
Cache.StatsResult.EntriesRequested       129
Cache.StatsResult.EntriesFound   129
Cache.StatsResult.EntriesStored          51
Cache.StatsResult.BytesSent              0 B
Cache.StatsResult.BytesReceived          75 kB
...

And the new stats from metrics.go

... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti
me=156.864429ms ...

Special notes for your reviewer:

Blocked by Stats cache can be configured independently #9535
Note the newstats.GetOrCreateContext func. It's used inside the query.Exec method so we don't overwrite the stats added in the stats middleware.

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
CHANGELOG.md updated
Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md
For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR

**What this PR does / why we need it**: Before this PR, the index stats cache would use the same config as the query results cache. This was a limitation since: 1. We would not be able to point to a different cache for storing the index stats if needed. 2. We would not be able to add specific settings for this cache, without adding it to the results cache. In this PR, we refactor the index stats cache config to be independently configurable. Note that if it's not configured, it will try to use the results cache settings. **Which issue(s) this PR fixes**: This is needed for: - #9537 - #9536 **Special notes for your reviewer**: - This PR also refactors all the tripperwares in rountrip.go to reuse the same stats tripperware instead of each one creating their own. - Configuring a new cache in rountrip.go is a requirement for #9536 so the stats summary can distinguish before the stats cache and the results cache. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)

**What this PR does / why we need it**: When a query finishes, we return (and log) the following stats: ```go Cache.Chunk.Requests 0 Cache.Chunk.EntriesRequested 0 Cache.Chunk.EntriesFound 0 Cache.Chunk.EntriesStored 0 Cache.Chunk.BytesSent 0 B Cache.Chunk.BytesReceived 0 B Cache.Chunk.DownloadTime 0s Cache.Index.Requests 0 Cache.Index.EntriesRequested 0 Cache.Index.EntriesFound 0 Cache.Index.EntriesStored 0 Cache.Index.BytesSent 0 B Cache.Index.BytesReceived 0 B Cache.Index.DownloadTime 0s Cache.Result.Requests 13 Cache.Result.EntriesRequested 13 Cache.Result.EntriesFound 13 Cache.Result.EntriesStored 0 Cache.Result.BytesSent 0 B Cache.Result.BytesReceived 2.5 kB Cache.Result.DownloadTime 4.600266ms ``` In addition to that, we log the following in metrics.go: ``` level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester |promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes _fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms ``` With the goal of being able to better monitor how the stats cache is performing; this PR adds stats for the index stats cache, similarly to how it's done for the results cache. Here's an example of the new stats being returned and printed: ```go ... Cache.StatsResult.Requests 180 Cache.StatsResult.EntriesRequested 129 Cache.StatsResult.EntriesFound 129 Cache.StatsResult.EntriesStored 51 Cache.StatsResult.BytesSent 0 B Cache.StatsResult.BytesReceived 75 kB ... ``` And the new stats from metrics.go ``` ... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti me=156.864429ms ... ``` **Special notes for your reviewer**: - Blocked by #9535 - Note the new`stats.GetOrCreateContext` func. It's used inside the `query.Exec` method so we don't overwrite the stats added in the stats middleware. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)

**What this PR does / why we need it**: Before this PR, the index stats cache would use the same config as the query results cache. This was a limitation since: 1. We would not be able to point to a different cache for storing the index stats if needed. 2. We would not be able to add specific settings for this cache, without adding it to the results cache. In this PR, we refactor the index stats cache config to be independently configurable. Note that if it's not configured, it will try to use the results cache settings. **Which issue(s) this PR fixes**: This is needed for: - #9537 - #9536 **Special notes for your reviewer**: - This PR also refactors all the tripperwares in rountrip.go to reuse the same stats tripperware instead of each one creating their own. - Configuring a new cache in rountrip.go is a requirement for #9536 so the stats summary can distinguish before the stats cache and the results cache. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)

**What this PR does / why we need it**: When a query finishes, we return (and log) the following stats: ```go Cache.Chunk.Requests 0 Cache.Chunk.EntriesRequested 0 Cache.Chunk.EntriesFound 0 Cache.Chunk.EntriesStored 0 Cache.Chunk.BytesSent 0 B Cache.Chunk.BytesReceived 0 B Cache.Chunk.DownloadTime 0s Cache.Index.Requests 0 Cache.Index.EntriesRequested 0 Cache.Index.EntriesFound 0 Cache.Index.EntriesStored 0 Cache.Index.BytesSent 0 B Cache.Index.BytesReceived 0 B Cache.Index.DownloadTime 0s Cache.Result.Requests 13 Cache.Result.EntriesRequested 13 Cache.Result.EntriesFound 13 Cache.Result.EntriesStored 0 Cache.Result.BytesSent 0 B Cache.Result.BytesReceived 2.5 kB Cache.Result.DownloadTime 4.600266ms ``` In addition to that, we log the following in metrics.go: ``` level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester |promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes _fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms ``` With the goal of being able to better monitor how the stats cache is performing; this PR adds stats for the index stats cache, similarly to how it's done for the results cache. Here's an example of the new stats being returned and printed: ```go ... Cache.StatsResult.Requests 180 Cache.StatsResult.EntriesRequested 129 Cache.StatsResult.EntriesFound 129 Cache.StatsResult.EntriesStored 51 Cache.StatsResult.BytesSent 0 B Cache.StatsResult.BytesReceived 75 kB ... ``` And the new stats from metrics.go ``` ... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti me=156.864429ms ... ``` **Special notes for your reviewer**: - Blocked by #9535 - Note the new`stats.GetOrCreateContext` func. It's used inside the `query.Exec` method so we don't overwrite the stats added in the stats middleware. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)

Adds changes from: - #9535 - #9536 - #9537 - #9529 - #9552 So we can use k150 with newer config.

This reverts commit af287ac.

This reverts commit af287ac. There is a bug in this PR that inflates the stats returned for the query since we reuse the stats ctx in the query execution engine.

This reverts commit af287ac. Note that I had to cherrypick #9601 so the CI would pass. That made the diff for this PR huge. --------- Co-authored-by: Trevor Whitney <trevorjwhitney@gmail.com>

**What this PR does / why we need it**: When a query finishes, we return (and log) the following stats: ```go Cache.Chunk.Requests 0 Cache.Chunk.EntriesRequested 0 Cache.Chunk.EntriesFound 0 Cache.Chunk.EntriesStored 0 Cache.Chunk.BytesSent 0 B Cache.Chunk.BytesReceived 0 B Cache.Chunk.DownloadTime 0s Cache.Index.Requests 0 Cache.Index.EntriesRequested 0 Cache.Index.EntriesFound 0 Cache.Index.EntriesStored 0 Cache.Index.BytesSent 0 B Cache.Index.BytesReceived 0 B Cache.Index.DownloadTime 0s Cache.Result.Requests 13 Cache.Result.EntriesRequested 13 Cache.Result.EntriesFound 13 Cache.Result.EntriesStored 0 Cache.Result.BytesSent 0 B Cache.Result.BytesReceived 2.5 kB Cache.Result.DownloadTime 4.600266ms ``` In addition to that, we log the following in metrics.go: ``` level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester |promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes _fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms ``` With the goal of being able to better monitor how the stats cache is performing; this PR adds stats for the index stats cache, similarly to how it's done for the results cache. Here's an example of the new stats being returned and printed: ```go ... Cache.StatsResult.Requests 180 Cache.StatsResult.EntriesRequested 129 Cache.StatsResult.EntriesFound 129 Cache.StatsResult.EntriesStored 51 Cache.StatsResult.BytesSent 0 B Cache.StatsResult.BytesReceived 75 kB ... ``` And the new stats from metrics.go ``` ... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti me=156.864429ms ... ``` **Special notes for your reviewer**: - Blocked by #9535 - Note the new`stats.GetOrCreateContext` func. It's used inside the `query.Exec` method so we don't overwrite the stats added in the stats middleware. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)

**What this PR does / why we need it**: In #9536, we added cache stats for index stats requests. That PR had a bug that inflated the query stats due to reusing the stats context in the query engine. Therefore, we had to revert the PR at #9721. This PR brings back the changes from #9536 but fixes the inflated starts by no longer reusing the same context in the query engine, but rather creating a new one for the shard resolver. I tested it on a dev cluster and seems to be working fine. here's the output for the same query: **Stats with the bug from #9536**: ``` ... Cache.StatsResult.Requests 980 Cache.StatsResult.EntriesRequested 490 Cache.StatsResult.EntriesFound 0 Cache.StatsResult.EntriesStored 490 Cache.StatsResult.BytesSent 0 B Cache.StatsResult.BytesReceived 0 B ... Summary.BytesProcessedPerSecond 43 GB Summary.LinesProcessedPerSecond 93305142 Summary.TotalBytesProcessed 945 GB Summary.TotalLinesProcessed 2059694183 ``` **Stats from _main_** ``` ... Summary.BytesProcessedPerSecond 1.6 GB Summary.LinesProcessedPerSecond 3403718 Summary.TotalBytesProcessed 95 GB Summary.TotalLinesProcessed 207971404 ``` **Stats with fix in this PR** ``` .. Cache.StatsResult.Requests 132 Cache.StatsResult.EntriesRequested 66 Cache.StatsResult.EntriesFound 0 Cache.StatsResult.EntriesStored 66 Cache.StatsResult.BytesSent 0 B Cache.StatsResult.BytesReceived 0 B ... Summary.BytesProcessedPerSecond 4.3 GB Summary.LinesProcessedPerSecond 9468900 Summary.TotalBytesProcessed 95 GB Summary.TotalLinesProcessed 207793816 ``` As can be seen, with the changes in this PR, the summary stats are no longer inflated. **Which issue(s) this PR fixes**: Fixes #9536 **Special notes for your reviewer**: I think it's ok to skip reviewing the changes from the commit cherry-picking the changes from #9536 **Checklist** - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)

pull-request-size bot added the size/M label May 26, 2023

salvacorts mentioned this pull request May 29, 2023

Stats cache can be configured independently #9535

Merged

6 tasks

owen-d approved these changes Jun 6, 2023

View reviewed changes

Base automatically changed from salvacorts/stats-cache-summary-stats-and-metrics to main June 7, 2023 09:00

pull-request-size bot added size/XL and removed size/M labels Jun 7, 2023

salvacorts marked this pull request as ready for review June 7, 2023 09:02

salvacorts requested review from a team and JStickler as code owners June 7, 2023 09:02

Add stats and metrics for index stats cache

4e0f17f

salvacorts force-pushed the salvacorts/summary-stats-and-metrics-for-stats-cache branch from e910cc8 to 4e0f17f Compare June 7, 2023 09:13

pull-request-size bot added size/M and removed size/XL labels Jun 7, 2023

salvacorts removed the request for review from JStickler June 7, 2023 09:25

salvacorts merged commit af287ac into main Jun 7, 2023

salvacorts deleted the salvacorts/summary-stats-and-metrics-for-stats-cache branch June 7, 2023 09:25

salvacorts mentioned this pull request Jun 13, 2023

k150 with stats cache changes #9680

Merged

salvacorts added a commit that referenced this pull request Jun 13, 2023

k150 with stats cache changes (#9680)

ff001a3

Adds changes from: - #9535 - #9536 - #9537 - #9529 - #9552 So we can use k150 with newer config.

salvacorts added a commit that referenced this pull request Jun 15, 2023

Revert "Add summary stats and metrics for stats cache (#9536)"

84ae595

This reverts commit af287ac.

salvacorts added a commit that referenced this pull request Jun 15, 2023

Revert "Add summary stats and metrics for stats cache (#9536)"

eb4a600

This reverts commit af287ac.

salvacorts added a commit that referenced this pull request Jun 16, 2023

Revert "Add summary stats and metrics for stats cache (#9536)"

d7dee2f

This reverts commit af287ac.

kavirajk pushed a commit that referenced this pull request Jun 16, 2023

Revert "Add summary stats and metrics for stats cache (#9536)" (#9713)

c206d68

This reverts commit af287ac.

salvacorts mentioned this pull request Jun 28, 2023

Add back cache stats for index stats requests #9816

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add summary stats and metrics for stats cache #9536

Add summary stats and metrics for stats cache #9536

Uh oh!

salvacorts commented May 26, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add summary stats and metrics for stats cache #9536

Add summary stats and metrics for stats cache #9536

Uh oh!

Conversation

salvacorts commented May 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

salvacorts commented May 26, 2023 •

edited

Loading