New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
set gc grace to zero #3513
set gc grace to zero #3513
Conversation
Do we actually need this/benefit from this being greater than zero? It'd be real nice for short lived things to never end up in SSTables. As I understand it, read/write quorum and delete consistency all should save us from issues.
note that this will also set the hint ttl to 0 (effectively disabling it). |
My only worry is around quroum writes that didn't create 3 copies / that hints were kind of nice because we don't have any maintenance-style repair currently. |
For a nice explanation of what Ashray and Clocks are getting at: http://thelastpickle.com/blog/2018/03/21/hinted-handoff-gc-grace-demystified.html In practice now that we're issuing range tombstones in targeted sweep I don't think we really have to worry about tombstone accumulation the same way we used to. Two overlapping range tombstones where 1 is both larger and newer than the other will cause the smaller/older one to get removed regardless of gc_grace, and I'd expect the TS write pattern to always result in this. So in very high overwrite tables you'd generally expect this to occur in the memtable layer and then persist just a single tombstone on every flush which will then be compacted relatively quickly. Would be good to actually observe what I'm describing in practice, but yea I don't think we need to sacrifice the hints here for what I understand as little to no benefit. |
As I understand it, if you do quorum and delete 'ALL', all the hints ever help is perf. But they also hurt stability, so there's a bit of a tradeoff, right? The workflow I know exists is one where e.g. you might have a table of active jobs, and you have maybe 100 jobs active at a time, with each of them taking maybe 60 seconds, and being swept after a few minutes. With GC grace 1 hour, your range scan to select all of them is going to be reading 6000 cells of which 5900 are tombstones. With GC grace 0, you plausibly never really end up with data on disk, right? That's the workflow i was hoping to enable here. |
I agree with you about the tombstones, I'm just worried about it maybe disabling hints. A specific example of what I'm worried about is a user with a completely immutable, append-only style table access pattern, where data is written once and rarely if ever read and there's no reason to sweep anything since its all live. If a write to such a table fails to achieve full RF3 replication, today at least there's a good chance it'll eventually get fixed via the very targeted repair of hinted handoff when the bad node starts behaving better and takes delivery of hints from other nodes. But otherwise the only mechanism by which that repair would happen today is happening to hit the 10% read-repair-chance lottery while happening to read the poorly-replicated data in a query and having Cassandra herself kick off her own targeted repair via the read-repair mechanism. Fixes to that situation I think are either of :
|
What's the resolution here? Do we want to track having automatic repair infra before considering this? |
We caught up offline since we had a meeting that had James, Tom, and myself already. We came to the conclusion that this commit is fine, but that we should be forcing repairs at least before some cluster-expansion operations that currently do not force a repair in our scripts. |
Do we actually need this/benefit from this being greater than zero? It'd be real nice for short lived things to never end up in SSTables. As I understand it, read/write quorum and delete consistency all should save us from issues.
This change is