-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Add summary stats and metrics for stats cache #9536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
salvacorts
merged 1 commit into
main
from
salvacorts/summary-stats-and-metrics-for-stats-cache
Jun 7, 2023
Merged
Add summary stats and metrics for stats cache #9536
salvacorts
merged 1 commit into
main
from
salvacorts/summary-stats-and-metrics-for-stats-cache
Jun 7, 2023
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
6 tasks
owen-d
approved these changes
Jun 6, 2023
Base automatically changed from
salvacorts/stats-cache-summary-stats-and-metrics
to
main
June 7, 2023 09:00
salvacorts
added a commit
that referenced
this pull request
Jun 7, 2023
**What this PR does / why we need it**: Before this PR, the index stats cache would use the same config as the query results cache. This was a limitation since: 1. We would not be able to point to a different cache for storing the index stats if needed. 2. We would not be able to add specific settings for this cache, without adding it to the results cache. In this PR, we refactor the index stats cache config to be independently configurable. Note that if it's not configured, it will try to use the results cache settings. **Which issue(s) this PR fixes**: This is needed for: - #9537 - #9536 **Special notes for your reviewer**: - This PR also refactors all the tripperwares in rountrip.go to reuse the same stats tripperware instead of each one creating their own. - Configuring a new cache in rountrip.go is a requirement for #9536 so the stats summary can distinguish before the stats cache and the results cache. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)
e910cc8 to
4e0f17f
Compare
salvacorts
added a commit
that referenced
this pull request
Jun 12, 2023
**What this PR does / why we need it**: Before this PR, the index stats cache would use the same config as the query results cache. This was a limitation since: 1. We would not be able to point to a different cache for storing the index stats if needed. 2. We would not be able to add specific settings for this cache, without adding it to the results cache. In this PR, we refactor the index stats cache config to be independently configurable. Note that if it's not configured, it will try to use the results cache settings. **Which issue(s) this PR fixes**: This is needed for: - #9537 - #9536 **Special notes for your reviewer**: - This PR also refactors all the tripperwares in rountrip.go to reuse the same stats tripperware instead of each one creating their own. - Configuring a new cache in rountrip.go is a requirement for #9536 so the stats summary can distinguish before the stats cache and the results cache. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)
salvacorts
added a commit
that referenced
this pull request
Jun 12, 2023
**What this PR does / why we need it**:
When a query finishes, we return (and log) the following stats:
```go
Cache.Chunk.Requests 0
Cache.Chunk.EntriesRequested 0
Cache.Chunk.EntriesFound 0
Cache.Chunk.EntriesStored 0
Cache.Chunk.BytesSent 0 B
Cache.Chunk.BytesReceived 0 B
Cache.Chunk.DownloadTime 0s
Cache.Index.Requests 0
Cache.Index.EntriesRequested 0
Cache.Index.EntriesFound 0
Cache.Index.EntriesStored 0
Cache.Index.BytesSent 0 B
Cache.Index.BytesReceived 0 B
Cache.Index.DownloadTime 0s
Cache.Result.Requests 13
Cache.Result.EntriesRequested 13
Cache.Result.EntriesFound 13
Cache.Result.EntriesStored 0
Cache.Result.BytesSent 0 B
Cache.Result.BytesReceived 2.5 kB
Cache.Result.DownloadTime 4.600266ms
```
In addition to that, we log the following in metrics.go:
```
level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester
|promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim
it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes
_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms
```
With the goal of being able to better monitor how the stats cache is
performing; this PR adds stats for the index stats cache, similarly to
how it's done for the results cache.
Here's an example of the new stats being returned and printed:
```go
...
Cache.StatsResult.Requests 180
Cache.StatsResult.EntriesRequested 129
Cache.StatsResult.EntriesFound 129
Cache.StatsResult.EntriesStored 51
Cache.StatsResult.BytesSent 0 B
Cache.StatsResult.BytesReceived 75 kB
...
```
And the new stats from metrics.go
```
... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti
me=156.864429ms ...
```
**Special notes for your reviewer**:
- Blocked by #9535
- Note the new`stats.GetOrCreateContext` func. It's used inside the
`query.Exec` method so we don't overwrite the stats added in the stats
middleware.
**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [x] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
salvacorts
added a commit
that referenced
this pull request
Jun 13, 2023
**What this PR does / why we need it**: Before this PR, the index stats cache would use the same config as the query results cache. This was a limitation since: 1. We would not be able to point to a different cache for storing the index stats if needed. 2. We would not be able to add specific settings for this cache, without adding it to the results cache. In this PR, we refactor the index stats cache config to be independently configurable. Note that if it's not configured, it will try to use the results cache settings. **Which issue(s) this PR fixes**: This is needed for: - #9537 - #9536 **Special notes for your reviewer**: - This PR also refactors all the tripperwares in rountrip.go to reuse the same stats tripperware instead of each one creating their own. - Configuring a new cache in rountrip.go is a requirement for #9536 so the stats summary can distinguish before the stats cache and the results cache. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)
salvacorts
added a commit
that referenced
this pull request
Jun 13, 2023
**What this PR does / why we need it**:
When a query finishes, we return (and log) the following stats:
```go
Cache.Chunk.Requests 0
Cache.Chunk.EntriesRequested 0
Cache.Chunk.EntriesFound 0
Cache.Chunk.EntriesStored 0
Cache.Chunk.BytesSent 0 B
Cache.Chunk.BytesReceived 0 B
Cache.Chunk.DownloadTime 0s
Cache.Index.Requests 0
Cache.Index.EntriesRequested 0
Cache.Index.EntriesFound 0
Cache.Index.EntriesStored 0
Cache.Index.BytesSent 0 B
Cache.Index.BytesReceived 0 B
Cache.Index.DownloadTime 0s
Cache.Result.Requests 13
Cache.Result.EntriesRequested 13
Cache.Result.EntriesFound 13
Cache.Result.EntriesStored 0
Cache.Result.BytesSent 0 B
Cache.Result.BytesReceived 2.5 kB
Cache.Result.DownloadTime 4.600266ms
```
In addition to that, we log the following in metrics.go:
```
level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester
|promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim
it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes
_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms
```
With the goal of being able to better monitor how the stats cache is
performing; this PR adds stats for the index stats cache, similarly to
how it's done for the results cache.
Here's an example of the new stats being returned and printed:
```go
...
Cache.StatsResult.Requests 180
Cache.StatsResult.EntriesRequested 129
Cache.StatsResult.EntriesFound 129
Cache.StatsResult.EntriesStored 51
Cache.StatsResult.BytesSent 0 B
Cache.StatsResult.BytesReceived 75 kB
...
```
And the new stats from metrics.go
```
... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti
me=156.864429ms ...
```
**Special notes for your reviewer**:
- Blocked by #9535
- Note the new`stats.GetOrCreateContext` func. It's used inside the
`query.Exec` method so we don't overwrite the stats added in the stats
middleware.
**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [x] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
salvacorts
added a commit
that referenced
this pull request
Jun 13, 2023
salvacorts
added a commit
that referenced
this pull request
Jun 15, 2023
This reverts commit af287ac.
salvacorts
added a commit
that referenced
this pull request
Jun 15, 2023
This reverts commit af287ac.
salvacorts
added a commit
that referenced
this pull request
Jun 16, 2023
This reverts commit af287ac.
kavirajk
pushed a commit
that referenced
this pull request
Jun 16, 2023
salvacorts
added a commit
that referenced
this pull request
Jun 16, 2023
salvacorts
added a commit
that referenced
this pull request
Jun 16, 2023
salvacorts
added a commit
that referenced
this pull request
Jun 28, 2023
**What this PR does / why we need it**:
When a query finishes, we return (and log) the following stats:
```go
Cache.Chunk.Requests 0
Cache.Chunk.EntriesRequested 0
Cache.Chunk.EntriesFound 0
Cache.Chunk.EntriesStored 0
Cache.Chunk.BytesSent 0 B
Cache.Chunk.BytesReceived 0 B
Cache.Chunk.DownloadTime 0s
Cache.Index.Requests 0
Cache.Index.EntriesRequested 0
Cache.Index.EntriesFound 0
Cache.Index.EntriesStored 0
Cache.Index.BytesSent 0 B
Cache.Index.BytesReceived 0 B
Cache.Index.DownloadTime 0s
Cache.Result.Requests 13
Cache.Result.EntriesRequested 13
Cache.Result.EntriesFound 13
Cache.Result.EntriesStored 0
Cache.Result.BytesSent 0 B
Cache.Result.BytesReceived 2.5 kB
Cache.Result.DownloadTime 4.600266ms
```
In addition to that, we log the following in metrics.go:
```
level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester
|promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim
it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes
_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms
```
With the goal of being able to better monitor how the stats cache is
performing; this PR adds stats for the index stats cache, similarly to
how it's done for the results cache.
Here's an example of the new stats being returned and printed:
```go
...
Cache.StatsResult.Requests 180
Cache.StatsResult.EntriesRequested 129
Cache.StatsResult.EntriesFound 129
Cache.StatsResult.EntriesStored 51
Cache.StatsResult.BytesSent 0 B
Cache.StatsResult.BytesReceived 75 kB
...
```
And the new stats from metrics.go
```
... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti
me=156.864429ms ...
```
**Special notes for your reviewer**:
- Blocked by #9535
- Note the new`stats.GetOrCreateContext` func. It's used inside the
`query.Exec` method so we don't overwrite the stats added in the stats
middleware.
**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [x] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
7 tasks
salvacorts
added a commit
that referenced
this pull request
Jun 28, 2023
**What this PR does / why we need it**: In #9536, we added cache stats for index stats requests. That PR had a bug that inflated the query stats due to reusing the stats context in the query engine. Therefore, we had to revert the PR at #9721. This PR brings back the changes from #9536 but fixes the inflated starts by no longer reusing the same context in the query engine, but rather creating a new one for the shard resolver. I tested it on a dev cluster and seems to be working fine. here's the output for the same query: **Stats with the bug from #9536**: ``` ... Cache.StatsResult.Requests 980 Cache.StatsResult.EntriesRequested 490 Cache.StatsResult.EntriesFound 0 Cache.StatsResult.EntriesStored 490 Cache.StatsResult.BytesSent 0 B Cache.StatsResult.BytesReceived 0 B ... Summary.BytesProcessedPerSecond 43 GB Summary.LinesProcessedPerSecond 93305142 Summary.TotalBytesProcessed 945 GB Summary.TotalLinesProcessed 2059694183 ``` **Stats from _main_** ``` ... Summary.BytesProcessedPerSecond 1.6 GB Summary.LinesProcessedPerSecond 3403718 Summary.TotalBytesProcessed 95 GB Summary.TotalLinesProcessed 207971404 ``` **Stats with fix in this PR** ``` .. Cache.StatsResult.Requests 132 Cache.StatsResult.EntriesRequested 66 Cache.StatsResult.EntriesFound 0 Cache.StatsResult.EntriesStored 66 Cache.StatsResult.BytesSent 0 B Cache.StatsResult.BytesReceived 0 B ... Summary.BytesProcessedPerSecond 4.3 GB Summary.LinesProcessedPerSecond 9468900 Summary.TotalBytesProcessed 95 GB Summary.TotalLinesProcessed 207793816 ``` As can be seen, with the changes in this PR, the summary stats are no longer inflated. **Which issue(s) this PR fixes**: Fixes #9536 **Special notes for your reviewer**: I think it's ok to skip reviewing the changes from the commit cherry-picking the changes from #9536 **Checklist** - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
When a query finishes, we return (and log) the following stats:
In addition to that, we log the following in metrics.go:
With the goal of being able to better monitor how the stats cache is performing; this PR adds stats for the index stats cache, similarly to how it's done for the results cache.
Here's an example of the new stats being returned and printed:
And the new stats from metrics.go
Special notes for your reviewer:
stats.GetOrCreateContextfunc. It's used inside thequery.Execmethod so we don't overwrite the stats added in the stats middleware.Checklist
CONTRIBUTING.mdguide (required)CHANGELOG.mdupdateddocs/sources/upgrading/_index.mdproduction/helm/loki/Chart.yamland updateproduction/helm/loki/CHANGELOG.mdandproduction/helm/loki/README.md. Example PR