Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalidate caches on delete #5661

Merged

Conversation

MasslessParticle
Copy link
Contributor

@MasslessParticle MasslessParticle commented Mar 17, 2022

This PR does two things:

  1. Generate and store a generation number for a tenant when deletes are added, processed, or removed.
  2. Creates an implementation of CacheGenNumLoader that is passed to our metrics Tripperware

The result is that when deletes are added or change for a user, the caches are queried with an updated generation number on the keys so a user can see filtered/unfiltered results as deletes are created and processed.

Checklist

  • Documentation added
  • Tests updated
  • Add an entry in the CHANGELOG.md about the changes.

@MasslessParticle MasslessParticle requested a review from a team as a code owner March 17, 2022 15:36
Copy link
Contributor

@MichelHollands MichelHollands left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A few smaller comments.

pkg/loki/loki.go Outdated Show resolved Hide resolved
Copy link
Contributor

@sandeepsukhani sandeepsukhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some thoughts. Please let me know what you think about those.

Comment on lines 111 to 117
func (l *GenNumberLoader) GetResultsCacheGenNumber(tenantIDs []string) string {
return l.getCacheGenNumbersPerTenants(tenantIDs)
}

func (l *GenNumberLoader) GetStoreCacheGenNumber(tenantIDs []string) string {
return l.getCacheGenNumbersPerTenants(tenantIDs)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The store and results cache gen numbers must be different and they must be incremented at different times.

store cache gen would be incremented when a delete request is marked as processed. This is because we are changing the data in storage and it could still be there in the cache.

results cache gen would be incremented when a delete request is created/cancelled. This is because we are going to start/stop doing query time filtering for the affected streams.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a little confused on this. Why not just invalidate all the caches whenever a delete happens? Is the idea to preserve the storage cache as long as possible?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too am confused. If we invalidate both, won't the results cache get updated as a result of caching the results from the query that hit the actual store on account of the store cache also being invalidated? That seems like less complexity to me, just always invalidate both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a new delete request comes in, we will do query time filtering of data until the actual data from the storage gets deleted. So if we invalidate both results and store cache when a new delete request comes in, we would anyways cache the same data since we have not processed the delete requests yet, so it would be a waste to drop and cache the same data again.

Similarly, when the delete request is processed, whatever results we have cached would be built after filtering out deleted data since we would have cleared the results cache already when the delete request was received. So, if we invalidate the results cache again after deleting data from the store, we would be wasting resources to drop and cache the same data again, this time without query time filtering but with the actual data deleted from storage.

Does it makes sense? Please let me know if there is still any confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. I'm going to implement it in a follow on PR

pkg/loki/loki.go Outdated
Ingester: {Store, Server, MemberlistKV, TenantConfigs, UsageReport},
Querier: {Store, Ring, Server, IngesterQuerier, TenantConfigs, UsageReport},
QueryFrontendTripperware: {Server, Overrides, TenantConfigs},
QueryFrontendTripperware: {Server, Overrides, TenantConfigs, CacheGenNumberLoader},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a couple of complexities with initializing cache gen number loader from QF:

  • QF will now depend on the index store for downloading delete requests db and reading cache gen number from it.
  • Deployment of QF will become complex in a case where there are custom mount paths set in yaml config for downloading boltdb files. In jsonnet it is set to /data/boltdb-cache and the same yaml is shared by all the components. Since QF does not have any such path, it would require us to change it for QF or provision the same path for QF to be able to write to it.

I was hoping to avoid adding store dependency to QF. The other way I can think of is QF pulling results cache gen numbers from Queriers, using an API. It would only keep gen number in memory for the users whose queries it has got for processing. QF would keep pulling updates from Queriers for all the users gen numbers that it has in memory.

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this could be a messy dependency as now there are potentially 2 components downloading chunks to store locally, increasing traffic to object storage, and agree the idea of an API QF can use to Querier is worth pursuing.

I am, however, having trouble following the path issue. I don't see a mount to /data for either QF or querier, so I think both are just using ephemeral disk inside the container, and it shouldn't matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Queriers already depend on Frontends and I'm wary of introducing an API where frontends now depend on queriers. I think a better way to do things might be to make a "shipper" specifically for deletes that can write to ephemeral. I think the only real difference would be the tables that are downloaded.

This is a POC that seems to work where the whole shipper writes to ephemeral disk, MasslessParticle#3

If we take this approach, making a delete-specific component could be a follow-on PR. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re API to get results cache gen: We can also makw QF talk to Compactor instead of Queries.

Re downloading delete requests table in QF: Other than adding a dependency on Store, the thing I am concerned about is whether we will be able to cover all failure modes like QF running as a binary on a baremetal not having write access to temp, which I am not sure is even possible.

Copy link
Collaborator

@trevorwhitney trevorwhitney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just a couple reactions to @sandeepsukhani's comments.

Comment on lines 111 to 117
func (l *GenNumberLoader) GetResultsCacheGenNumber(tenantIDs []string) string {
return l.getCacheGenNumbersPerTenants(tenantIDs)
}

func (l *GenNumberLoader) GetStoreCacheGenNumber(tenantIDs []string) string {
return l.getCacheGenNumbersPerTenants(tenantIDs)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too am confused. If we invalidate both, won't the results cache get updated as a result of caching the results from the query that hit the actual store on account of the store cache also being invalidated? That seems like less complexity to me, just always invalidate both.

pkg/loki/loki.go Outdated
Ingester: {Store, Server, MemberlistKV, TenantConfigs, UsageReport},
Querier: {Store, Ring, Server, IngesterQuerier, TenantConfigs, UsageReport},
QueryFrontendTripperware: {Server, Overrides, TenantConfigs},
QueryFrontendTripperware: {Server, Overrides, TenantConfigs, CacheGenNumberLoader},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this could be a messy dependency as now there are potentially 2 components downloading chunks to store locally, increasing traffic to object storage, and agree the idea of an API QF can use to Querier is worth pursuing.

I am, however, having trouble following the path issue. I don't see a mount to /data for either QF or querier, so I think both are just using ephemeral disk inside the container, and it shouldn't matter?

Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good, although I think it may be better served by the queriers|index-gateways, which aren't singletons like the compactor and are thus less likely to all be unavailable at once.

@MasslessParticle
Copy link
Contributor Author

We discussed where the API should live at length and settled on the compactor because it's the source of truth on deletes and related data.

The impact of a down compactor is that the caches are stale for a bit. The goal for deletes is that they happen on the order of hours in the worst case so hopefully someone notices their compactor is down before that?

Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Looks like there will be a followup as per the conversation with @sandeepsukhani and this only covers the results cache at the moment.

@owen-d owen-d merged commit 9cef86b into grafana:main Apr 21, 2022
MichelHollands added a commit that referenced this pull request Nov 7, 2022
Adds troubleshooting to generation cache(gennumber) errors. Ref:
#5661

Co-authored-by: Michel Hollands <42814411+MichelHollands@users.noreply.github.com>
changhyuni pushed a commit to changhyuni/loki that referenced this pull request Nov 8, 2022
Adds troubleshooting to generation cache(gennumber) errors. Ref:
grafana#5661

Co-authored-by: Michel Hollands <42814411+MichelHollands@users.noreply.github.com>
Abuelodelanada pushed a commit to canonical/loki that referenced this pull request Dec 1, 2022
Adds troubleshooting to generation cache(gennumber) errors. Ref:
grafana#5661

Co-authored-by: Michel Hollands <42814411+MichelHollands@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants