Invalidate caches on delete #5661

MasslessParticle · 2022-03-17T15:36:43Z

This PR does two things:

Generate and store a generation number for a tenant when deletes are added, processed, or removed.
Creates an implementation of CacheGenNumLoader that is passed to our metrics Tripperware

The result is that when deletes are added or change for a user, the caches are queried with an updated generation number on the keys so a user can see filtered/unfiltered results as deletes are created and processed.

Checklist

Documentation added
Tests updated
Add an entry in the CHANGELOG.md about the changes.

MichelHollands

LGTM. A few smaller comments.

pkg/loki/loki.go

pkg/storage/stores/shipper/compactor/deletion/gennumber_loader.go

pkg/storage/stores/shipper/compactor/deletion/gennumber_loader_test.go

sandeepsukhani

I have added some thoughts. Please let me know what you think about those.

sandeepsukhani · 2022-03-17T15:54:59Z

pkg/storage/stores/shipper/compactor/deletion/gennumber_loader.go

+func (l *GenNumberLoader) GetResultsCacheGenNumber(tenantIDs []string) string {
+	return l.getCacheGenNumbersPerTenants(tenantIDs)
+}
+
+func (l *GenNumberLoader) GetStoreCacheGenNumber(tenantIDs []string) string {
+	return l.getCacheGenNumbersPerTenants(tenantIDs)
+}


The store and results cache gen numbers must be different and they must be incremented at different times.

store cache gen would be incremented when a delete request is marked as processed. This is because we are changing the data in storage and it could still be there in the cache.

results cache gen would be incremented when a delete request is created/cancelled. This is because we are going to start/stop doing query time filtering for the affected streams.

I'm still a little confused on this. Why not just invalidate all the caches whenever a delete happens? Is the idea to preserve the storage cache as long as possible?

I too am confused. If we invalidate both, won't the results cache get updated as a result of caching the results from the query that hit the actual store on account of the store cache also being invalidated? That seems like less complexity to me, just always invalidate both.

When a new delete request comes in, we will do query time filtering of data until the actual data from the storage gets deleted. So if we invalidate both results and store cache when a new delete request comes in, we would anyways cache the same data since we have not processed the delete requests yet, so it would be a waste to drop and cache the same data again.

Similarly, when the delete request is processed, whatever results we have cached would be built after filtering out deleted data since we would have cleared the results cache already when the delete request was received. So, if we invalidate the results cache again after deleting data from the store, we would be wasting resources to drop and cache the same data again, this time without query time filtering but with the actual data deleted from storage.

Does it makes sense? Please let me know if there is still any confusion.

This makes sense. I'm going to implement it in a follow on PR

sandeepsukhani · 2022-03-17T16:25:39Z

pkg/loki/loki.go

 		Ingester:                 {Store, Server, MemberlistKV, TenantConfigs, UsageReport},
 		Querier:                  {Store, Ring, Server, IngesterQuerier, TenantConfigs, UsageReport},
-		QueryFrontendTripperware: {Server, Overrides, TenantConfigs},
+		QueryFrontendTripperware: {Server, Overrides, TenantConfigs, CacheGenNumberLoader},


I see a couple of complexities with initializing cache gen number loader from QF:

QF will now depend on the index store for downloading delete requests db and reading cache gen number from it.

Deployment of QF will become complex in a case where there are custom mount paths set in yaml config for downloading boltdb files. In jsonnet it is set to /data/boltdb-cache and the same yaml is shared by all the components. Since QF does not have any such path, it would require us to change it for QF or provision the same path for QF to be able to write to it.

I was hoping to avoid adding store dependency to QF. The other way I can think of is QF pulling results cache gen numbers from Queriers, using an API. It would only keep gen number in memory for the users whose queries it has got for processing. QF would keep pulling updates from Queriers for all the users gen numbers that it has in memory.

What do you think?

I agree that this could be a messy dependency as now there are potentially 2 components downloading chunks to store locally, increasing traffic to object storage, and agree the idea of an API QF can use to Querier is worth pursuing.

I am, however, having trouble following the path issue. I don't see a mount to /data for either QF or querier, so I think both are just using ephemeral disk inside the container, and it shouldn't matter?

Queriers already depend on Frontends and I'm wary of introducing an API where frontends now depend on queriers. I think a better way to do things might be to make a "shipper" specifically for deletes that can write to ephemeral. I think the only real difference would be the tables that are downloaded.

This is a POC that seems to work where the whole shipper writes to ephemeral disk, MasslessParticle#3

If we take this approach, making a delete-specific component could be a follow-on PR. Thoughts?

Re API to get results cache gen: We can also makw QF talk to Compactor instead of Queries.

Re downloading delete requests table in QF: Other than adding a dependency on Store, the thing I am concerned about is whether we will be able to cover all failure modes like QF running as a binary on a baremetal not having write access to temp, which I am not sure is even possible.

trevorwhitney

Looks great, just a couple reactions to @sandeepsukhani's comments.

trevorwhitney · 2022-03-18T19:09:27Z

pkg/storage/stores/shipper/compactor/deletion/gennumber_loader.go

+func (l *GenNumberLoader) GetResultsCacheGenNumber(tenantIDs []string) string {
+	return l.getCacheGenNumbersPerTenants(tenantIDs)
+}
+
+func (l *GenNumberLoader) GetStoreCacheGenNumber(tenantIDs []string) string {
+	return l.getCacheGenNumbersPerTenants(tenantIDs)
+}


I too am confused. If we invalidate both, won't the results cache get updated as a result of caching the results from the query that hit the actual store on account of the store cache also being invalidated? That seems like less complexity to me, just always invalidate both.

trevorwhitney · 2022-03-18T19:18:51Z

pkg/loki/loki.go

 		Ingester:                 {Store, Server, MemberlistKV, TenantConfigs, UsageReport},
 		Querier:                  {Store, Ring, Server, IngesterQuerier, TenantConfigs, UsageReport},
-		QueryFrontendTripperware: {Server, Overrides, TenantConfigs},
+		QueryFrontendTripperware: {Server, Overrides, TenantConfigs, CacheGenNumberLoader},


I agree that this could be a messy dependency as now there are potentially 2 components downloading chunks to store locally, increasing traffic to object storage, and agree the idea of an API QF can use to Querier is worth pursuing.

I am, however, having trouble following the path issue. I don't see a mount to /data for either QF or querier, so I think both are just using ephemeral disk inside the container, and it shouldn't matter?

pkg/loki/modules.go

pkg/storage/stores/shipper/compactor/deletion/delete_requests_manager_test.go

pkg/storage/stores/shipper/compactor/deletion/delete_requests_store.go

owen-d

This is looking pretty good, although I think it may be better served by the queriers|index-gateways, which aren't singletons like the compactor and are thus less likely to all be unavailable at once.

pkg/loki/modules.go

pkg/storage/stores/shipper/compactor/generationnumber/gennumber_client.go

pkg/storage/stores/shipper/compactor/generationnumber/gennumber_loader.go

MasslessParticle · 2022-04-20T19:43:37Z

We discussed where the API should live at length and settled on the compactor because it's the source of truth on deletes and related data.

The impact of a down compactor is that the caches are stale for a bit. The goal for deletes is that they happen on the order of hours in the worst case so hopefully someone notices their compactor is down before that?

owen-d

LGTM.

Looks like there will be a followup as per the conversation with @sandeepsukhani and this only covers the results cache at the moment.

Adds troubleshooting to generation cache(gennumber) errors. Ref: #5661 Co-authored-by: Michel Hollands <42814411+MichelHollands@users.noreply.github.com>

Adds troubleshooting to generation cache(gennumber) errors. Ref: grafana#5661 Co-authored-by: Michel Hollands <42814411+MichelHollands@users.noreply.github.com>

MasslessParticle added 2 commits March 17, 2022 08:52

Generate cache invalidation numbers in the delete store

7f4540a

Get cache generation numbers from the store on request

4b3cb9f

MasslessParticle requested a review from a team as a code owner March 17, 2022 15:36

changlog

b05b0f4

pull-request-size bot added the size/L label Mar 17, 2022

MichelHollands reviewed Mar 17, 2022

View reviewed changes

pkg/loki/loki.go Outdated Show resolved Hide resolved

pkg/storage/stores/shipper/compactor/deletion/gennumber_loader.go Outdated Show resolved Hide resolved

pkg/storage/stores/shipper/compactor/deletion/gennumber_loader_test.go Outdated Show resolved Hide resolved

sandeepsukhani reviewed Mar 17, 2022

View reviewed changes

MasslessParticle added 2 commits March 17, 2022 10:56

rename tombstones to something more meaningful

9b42a0d

User invisible module

681c3dd

trevorwhitney reviewed Mar 18, 2022

View reviewed changes

MasslessParticle added 6 commits March 31, 2022 13:41

Merge branch 'main' into tpatterson/invalidate-caches-on-delete

ab4fc24

Merge branch 'main' into tpatterson/invalidate-caches-on-delete

1f9d12f

query frontend relies on a compactor to get the cache generation number

4c5748e

Merge branch 'main' into tpatterson/invalidate-caches-on-delete

582f237

fix serialization

faa9716

source -> name

aede0ae

MasslessParticle requested a review from KMiller-Grafana as a code owner April 18, 2022 17:36

pull-request-size bot added size/XL and removed size/L labels Apr 18, 2022

MasslessParticle added 3 commits April 18, 2022 11:37

Merge branch 'main' into tpatterson/invalidate-caches-on-delete

9023e50

lint errors

c520427

lint errors

a6fcb8a

MasslessParticle requested a review from sandeepsukhani April 18, 2022 20:09

MasslessParticle added 3 commits April 18, 2022 16:18

log non-200 responses

75ef408

add jsonnet changes

fa875ba

lint

b16ba03

cstyan reviewed Apr 20, 2022

View reviewed changes

review feedback

e86d320

owen-d reviewed Apr 20, 2022

View reviewed changes

review feedback

c88a2f0

client rename

099fd95

owen-d approved these changes Apr 21, 2022

View reviewed changes

owen-d merged commit 9cef86b into grafana:main Apr 21, 2022

irizzant mentioned this pull request May 18, 2022

compactor_address configuration issue #6131

Closed

KMiller-Grafana mentioned this pull request Jun 14, 2022

Docs: Edits for several PRs that did not have a docs review #6398

Merged

This was referenced Jun 28, 2022

dannykopping/remove cache stats dannykopping/loki#13

Closed

dannykopping/remove cache stats dannykopping/loki#14

Draft

marcusteixeira mentioned this pull request Nov 4, 2022

Docs: troubleshooting generation cache errors #7600

Merged

5 tasks

MichelHollands added a commit that referenced this pull request Nov 7, 2022

Docs: troubleshooting generation cache errors (#7600)

0f131fe

Adds troubleshooting to generation cache(gennumber) errors. Ref: #5661 Co-authored-by: Michel Hollands <42814411+MichelHollands@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalidate caches on delete #5661

Invalidate caches on delete #5661

MasslessParticle commented Mar 17, 2022 •

edited

MichelHollands left a comment

sandeepsukhani left a comment

sandeepsukhani Mar 17, 2022 •

edited

MasslessParticle Mar 17, 2022

trevorwhitney Mar 18, 2022

sandeepsukhani Mar 20, 2022 •

edited

MasslessParticle Apr 20, 2022

sandeepsukhani Mar 17, 2022 •

edited

trevorwhitney Mar 18, 2022

MasslessParticle Mar 18, 2022

sandeepsukhani Mar 19, 2022

trevorwhitney left a comment

trevorwhitney Mar 18, 2022

trevorwhitney Mar 18, 2022

owen-d left a comment

MasslessParticle commented Apr 20, 2022

owen-d left a comment

Invalidate caches on delete #5661

Invalidate caches on delete #5661

Conversation

MasslessParticle commented Mar 17, 2022 • edited

MichelHollands left a comment

Choose a reason for hiding this comment

sandeepsukhani left a comment

Choose a reason for hiding this comment

sandeepsukhani Mar 17, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandeepsukhani Mar 20, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandeepsukhani Mar 17, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trevorwhitney left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

owen-d left a comment

Choose a reason for hiding this comment

MasslessParticle commented Apr 20, 2022

owen-d left a comment

Choose a reason for hiding this comment

MasslessParticle commented Mar 17, 2022 •

edited

sandeepsukhani Mar 17, 2022 •

edited

sandeepsukhani Mar 20, 2022 •

edited

sandeepsukhani Mar 17, 2022 •

edited