Skip to content

Conversation

@salvacorts
Copy link
Contributor

@salvacorts salvacorts commented May 26, 2023

What this PR does / why we need it:
When a query finishes, we return (and log) the following stats:

Cache.Chunk.Requests             0
Cache.Chunk.EntriesRequested     0
Cache.Chunk.EntriesFound         0
Cache.Chunk.EntriesStored        0
Cache.Chunk.BytesSent            0 B
Cache.Chunk.BytesReceived        0 B
Cache.Chunk.DownloadTime         0s
Cache.Index.Requests             0
Cache.Index.EntriesRequested     0
Cache.Index.EntriesFound         0
Cache.Index.EntriesStored        0
Cache.Index.BytesSent            0 B
Cache.Index.BytesReceived        0 B
Cache.Index.DownloadTime         0s
Cache.Result.Requests            13
Cache.Result.EntriesRequested    13
Cache.Result.EntriesFound        13
Cache.Result.EntriesStored       0
Cache.Result.BytesSent   0 B
Cache.Result.BytesReceived       2.5 kB
Cache.Result.DownloadTime        4.600266ms

In addition to that, we log the following in metrics.go:

level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester
|promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim
it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes
_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms

With the goal of being able to better monitor how the stats cache is performing; this PR adds stats for the index stats cache, similarly to how it's done for the results cache.

Here's an example of the new stats being returned and printed:

...
Cache.StatsResult.Requests               180
Cache.StatsResult.EntriesRequested       129
Cache.StatsResult.EntriesFound   129
Cache.StatsResult.EntriesStored          51
Cache.StatsResult.BytesSent              0 B
Cache.StatsResult.BytesReceived          75 kB
...

And the new stats from metrics.go

... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti
me=156.864429ms ...

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md
  • For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR

Base automatically changed from salvacorts/stats-cache-summary-stats-and-metrics to main June 7, 2023 09:00
salvacorts added a commit that referenced this pull request Jun 7, 2023
**What this PR does / why we need it**:

Before this PR, the index stats cache would use the same config as the
query results cache. This was a limitation since:

1. We would not be able to point to a different cache for storing the
index stats if needed.
2. We would not be able to add specific settings for this cache, without
adding it to the results cache.

In this PR, we refactor the index stats cache config to be independently
configurable. Note that if it's not configured, it will try to use the
results cache settings.

**Which issue(s) this PR fixes**:
This is needed for:
- #9537
- #9536

**Special notes for your reviewer**:

- This PR also refactors all the tripperwares in rountrip.go to reuse
the same stats tripperware instead of each one creating their own.
- Configuring a new cache in rountrip.go is a requirement for
#9536 so the stats summary can
distinguish before the stats cache and the results cache.

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [x] Documentation added
- [x] Tests updated
- [x] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
@pull-request-size pull-request-size bot added size/XL and removed size/M labels Jun 7, 2023
@salvacorts salvacorts marked this pull request as ready for review June 7, 2023 09:02
@salvacorts salvacorts requested review from a team and JStickler as code owners June 7, 2023 09:02
@salvacorts salvacorts force-pushed the salvacorts/summary-stats-and-metrics-for-stats-cache branch from e910cc8 to 4e0f17f Compare June 7, 2023 09:13
@pull-request-size pull-request-size bot added size/M and removed size/XL labels Jun 7, 2023
@salvacorts salvacorts removed the request for review from JStickler June 7, 2023 09:25
@salvacorts salvacorts merged commit af287ac into main Jun 7, 2023
@salvacorts salvacorts deleted the salvacorts/summary-stats-and-metrics-for-stats-cache branch June 7, 2023 09:25
salvacorts added a commit that referenced this pull request Jun 12, 2023
**What this PR does / why we need it**:

Before this PR, the index stats cache would use the same config as the
query results cache. This was a limitation since:

1. We would not be able to point to a different cache for storing the
index stats if needed.
2. We would not be able to add specific settings for this cache, without
adding it to the results cache.

In this PR, we refactor the index stats cache config to be independently
configurable. Note that if it's not configured, it will try to use the
results cache settings.

**Which issue(s) this PR fixes**:
This is needed for:
- #9537
- #9536

**Special notes for your reviewer**:

- This PR also refactors all the tripperwares in rountrip.go to reuse
the same stats tripperware instead of each one creating their own.
- Configuring a new cache in rountrip.go is a requirement for
#9536 so the stats summary can
distinguish before the stats cache and the results cache.

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [x] Documentation added
- [x] Tests updated
- [x] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
salvacorts added a commit that referenced this pull request Jun 12, 2023
**What this PR does / why we need it**:
When a query finishes, we return (and log) the following stats:
```go
Cache.Chunk.Requests             0
Cache.Chunk.EntriesRequested     0
Cache.Chunk.EntriesFound         0
Cache.Chunk.EntriesStored        0
Cache.Chunk.BytesSent            0 B
Cache.Chunk.BytesReceived        0 B
Cache.Chunk.DownloadTime         0s
Cache.Index.Requests             0
Cache.Index.EntriesRequested     0
Cache.Index.EntriesFound         0
Cache.Index.EntriesStored        0
Cache.Index.BytesSent            0 B
Cache.Index.BytesReceived        0 B
Cache.Index.DownloadTime         0s
Cache.Result.Requests            13
Cache.Result.EntriesRequested    13
Cache.Result.EntriesFound        13
Cache.Result.EntriesStored       0
Cache.Result.BytesSent   0 B
Cache.Result.BytesReceived       2.5 kB
Cache.Result.DownloadTime        4.600266ms
```

In addition to that, we log the following in metrics.go:
```
level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester
|promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim
it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes
_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms
```

With the goal of being able to better monitor how the stats cache is
performing; this PR adds stats for the index stats cache, similarly to
how it's done for the results cache.

Here's an example of the new stats being returned and printed:
```go
...
Cache.StatsResult.Requests               180
Cache.StatsResult.EntriesRequested       129
Cache.StatsResult.EntriesFound   129
Cache.StatsResult.EntriesStored          51
Cache.StatsResult.BytesSent              0 B
Cache.StatsResult.BytesReceived          75 kB
...
```

And the new stats from metrics.go
```
... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti
me=156.864429ms ...
```

**Special notes for your reviewer**:
- Blocked by #9535
- Note the new`stats.GetOrCreateContext` func. It's used inside the
`query.Exec` method so we don't overwrite the stats added in the stats
middleware.

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [x] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
salvacorts added a commit that referenced this pull request Jun 13, 2023
**What this PR does / why we need it**:

Before this PR, the index stats cache would use the same config as the
query results cache. This was a limitation since:

1. We would not be able to point to a different cache for storing the
index stats if needed.
2. We would not be able to add specific settings for this cache, without
adding it to the results cache.

In this PR, we refactor the index stats cache config to be independently
configurable. Note that if it's not configured, it will try to use the
results cache settings.

**Which issue(s) this PR fixes**:
This is needed for:
- #9537
- #9536

**Special notes for your reviewer**:

- This PR also refactors all the tripperwares in rountrip.go to reuse
the same stats tripperware instead of each one creating their own.
- Configuring a new cache in rountrip.go is a requirement for
#9536 so the stats summary can
distinguish before the stats cache and the results cache.

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [x] Documentation added
- [x] Tests updated
- [x] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
salvacorts added a commit that referenced this pull request Jun 13, 2023
**What this PR does / why we need it**:
When a query finishes, we return (and log) the following stats:
```go
Cache.Chunk.Requests             0
Cache.Chunk.EntriesRequested     0
Cache.Chunk.EntriesFound         0
Cache.Chunk.EntriesStored        0
Cache.Chunk.BytesSent            0 B
Cache.Chunk.BytesReceived        0 B
Cache.Chunk.DownloadTime         0s
Cache.Index.Requests             0
Cache.Index.EntriesRequested     0
Cache.Index.EntriesFound         0
Cache.Index.EntriesStored        0
Cache.Index.BytesSent            0 B
Cache.Index.BytesReceived        0 B
Cache.Index.DownloadTime         0s
Cache.Result.Requests            13
Cache.Result.EntriesRequested    13
Cache.Result.EntriesFound        13
Cache.Result.EntriesStored       0
Cache.Result.BytesSent   0 B
Cache.Result.BytesReceived       2.5 kB
Cache.Result.DownloadTime        4.600266ms
```

In addition to that, we log the following in metrics.go:
```
level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester
|promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim
it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes
_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms
```

With the goal of being able to better monitor how the stats cache is
performing; this PR adds stats for the index stats cache, similarly to
how it's done for the results cache.

Here's an example of the new stats being returned and printed:
```go
...
Cache.StatsResult.Requests               180
Cache.StatsResult.EntriesRequested       129
Cache.StatsResult.EntriesFound   129
Cache.StatsResult.EntriesStored          51
Cache.StatsResult.BytesSent              0 B
Cache.StatsResult.BytesReceived          75 kB
...
```

And the new stats from metrics.go
```
... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti
me=156.864429ms ...
```

**Special notes for your reviewer**:
- Blocked by #9535
- Note the new`stats.GetOrCreateContext` func. It's used inside the
`query.Exec` method so we don't overwrite the stats added in the stats
middleware.

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [x] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
salvacorts added a commit that referenced this pull request Jun 13, 2023
Adds changes from:

- #9535
- #9536
- #9537
- #9529
- #9552

So we can use k150 with newer config.
salvacorts added a commit that referenced this pull request Jun 15, 2023
salvacorts added a commit that referenced this pull request Jun 15, 2023
salvacorts added a commit that referenced this pull request Jun 16, 2023
kavirajk pushed a commit that referenced this pull request Jun 16, 2023
salvacorts added a commit that referenced this pull request Jun 16, 2023
This reverts commit af287ac.

There is a bug in this PR that inflates the stats returned for the query
since we reuse the stats ctx in the query execution engine.
salvacorts added a commit that referenced this pull request Jun 16, 2023
This reverts commit
af287ac.

Note that I had to cherrypick #9601
so the CI would pass. That made the diff for this PR huge.

---------

Co-authored-by: Trevor Whitney <trevorjwhitney@gmail.com>
salvacorts added a commit that referenced this pull request Jun 28, 2023
**What this PR does / why we need it**:
When a query finishes, we return (and log) the following stats:
```go
Cache.Chunk.Requests             0
Cache.Chunk.EntriesRequested     0
Cache.Chunk.EntriesFound         0
Cache.Chunk.EntriesStored        0
Cache.Chunk.BytesSent            0 B
Cache.Chunk.BytesReceived        0 B
Cache.Chunk.DownloadTime         0s
Cache.Index.Requests             0
Cache.Index.EntriesRequested     0
Cache.Index.EntriesFound         0
Cache.Index.EntriesStored        0
Cache.Index.BytesSent            0 B
Cache.Index.BytesReceived        0 B
Cache.Index.DownloadTime         0s
Cache.Result.Requests            13
Cache.Result.EntriesRequested    13
Cache.Result.EntriesFound        13
Cache.Result.EntriesStored       0
Cache.Result.BytesSent   0 B
Cache.Result.BytesReceived       2.5 kB
Cache.Result.DownloadTime        4.600266ms
```

In addition to that, we log the following in metrics.go:
```
level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester
|promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim
it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes
_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms
```

With the goal of being able to better monitor how the stats cache is
performing; this PR adds stats for the index stats cache, similarly to
how it's done for the results cache.

Here's an example of the new stats being returned and printed:
```go
...
Cache.StatsResult.Requests               180
Cache.StatsResult.EntriesRequested       129
Cache.StatsResult.EntriesFound   129
Cache.StatsResult.EntriesStored          51
Cache.StatsResult.BytesSent              0 B
Cache.StatsResult.BytesReceived          75 kB
...
```

And the new stats from metrics.go
```
... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti
me=156.864429ms ...
```

**Special notes for your reviewer**:
- Blocked by #9535
- Note the new`stats.GetOrCreateContext` func. It's used inside the
`query.Exec` method so we don't overwrite the stats added in the stats
middleware.

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [x] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
salvacorts added a commit that referenced this pull request Jun 28, 2023
**What this PR does / why we need it**:

In #9536, we added cache stats for
index stats requests. That PR had a bug that inflated the query stats
due to reusing the stats context in the query engine. Therefore, we had
to revert the PR at #9721.

This PR brings back the changes from
#9536 but fixes the inflated starts
by no longer reusing the same context in the query engine, but rather
creating a new one for the shard resolver.

I tested it on a dev cluster and seems to be working fine. here's the
output for the same query:

**Stats with the bug from #9536**:
```
...
Cache.StatsResult.Requests               980
Cache.StatsResult.EntriesRequested       490
Cache.StatsResult.EntriesFound   0
Cache.StatsResult.EntriesStored          490
Cache.StatsResult.BytesSent              0 B
Cache.StatsResult.BytesReceived          0 B
...
Summary.BytesProcessedPerSecond          43 GB
Summary.LinesProcessedPerSecond          93305142
Summary.TotalBytesProcessed              945 GB
Summary.TotalLinesProcessed              2059694183
```

**Stats from _main_**
```
...
Summary.BytesProcessedPerSecond          1.6 GB
Summary.LinesProcessedPerSecond          3403718
Summary.TotalBytesProcessed              95 GB
Summary.TotalLinesProcessed              207971404
```

**Stats with fix in this PR**
```
..
Cache.StatsResult.Requests               132
Cache.StatsResult.EntriesRequested       66
Cache.StatsResult.EntriesFound   0
Cache.StatsResult.EntriesStored          66
Cache.StatsResult.BytesSent              0 B
Cache.StatsResult.BytesReceived          0 B
...
Summary.BytesProcessedPerSecond          4.3 GB
Summary.LinesProcessedPerSecond          9468900
Summary.TotalBytesProcessed              95 GB
Summary.TotalLinesProcessed              207793816
```

As can be seen, with the changes in this PR, the summary stats are no
longer inflated.


**Which issue(s) this PR fixes**:
Fixes #9536

**Special notes for your reviewer**:

I think it's ok to skip reviewing the changes from the commit
cherry-picking the changes from
#9536

**Checklist**
- [ ] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [ ] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] If the change is worth mentioning in the release notes, add
`add-to-release-notes` label
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants