set gc grace to zero #3513

j-baker · 2018-09-19T17:16:59Z

Do we actually need this/benefit from this being greater than zero? It'd be real nice for short lived things to never end up in SSTables. As I understand it, read/write quorum and delete consistency all should save us from issues.

This change is

Do we actually need this/benefit from this being greater than zero? It'd be real nice for short lived things to never end up in SSTables. As I understand it, read/write quorum and delete consistency all should save us from issues.

ashrayjain · 2018-09-19T17:22:41Z

note that this will also set the hint ttl to 0 (effectively disabling it).

clockfort · 2018-09-19T18:25:26Z

My only worry is around quroum writes that didn't create 3 copies / that hints were kind of nice because we don't have any maintenance-style repair currently.

tpetracca · 2018-09-19T18:53:20Z

For a nice explanation of what Ashray and Clocks are getting at: http://thelastpickle.com/blog/2018/03/21/hinted-handoff-gc-grace-demystified.html

In practice now that we're issuing range tombstones in targeted sweep I don't think we really have to worry about tombstone accumulation the same way we used to. Two overlapping range tombstones where 1 is both larger and newer than the other will cause the smaller/older one to get removed regardless of gc_grace, and I'd expect the TS write pattern to always result in this. So in very high overwrite tables you'd generally expect this to occur in the memtable layer and then persist just a single tombstone on every flush which will then be compacted relatively quickly.

Would be good to actually observe what I'm describing in practice, but yea I don't think we need to sacrifice the hints here for what I understand as little to no benefit.

j-baker · 2018-09-21T15:50:50Z

As I understand it, if you do quorum and delete 'ALL', all the hints ever help is perf. But they also hurt stability, so there's a bit of a tradeoff, right?

The workflow I know exists is one where e.g. you might have a table of active jobs, and you have maybe 100 jobs active at a time, with each of them taking maybe 60 seconds, and being swept after a few minutes. With GC grace 1 hour, your range scan to select all of them is going to be reading 6000 cells of which 5900 are tombstones. With GC grace 0, you plausibly never really end up with data on disk, right? That's the workflow i was hoping to enable here.

clockfort · 2018-09-21T17:38:11Z

I agree with you about the tombstones, I'm just worried about it maybe disabling hints.

A specific example of what I'm worried about is a user with a completely immutable, append-only style table access pattern, where data is written once and rarely if ever read and there's no reason to sweep anything since its all live.

If a write to such a table fails to achieve full RF3 replication, today at least there's a good chance it'll eventually get fixed via the very targeted repair of hinted handoff when the bad node starts behaving better and takes delivery of hints from other nodes. But otherwise the only mechanism by which that repair would happen today is happening to hit the 10% read-repair-chance lottery while happening to read the poorly-replicated data in a query and having Cassandra herself kick off her own targeted repair via the read-repair mechanism.

Fixes to that situation I think are either of :

us(CassOps) actually writing background maintenance table repair jobs
us(AtlasOps) doing CassOp's maintenance job for them, and altering read repair to 100% at least during the period when sweep cruises through reading all of the data for a table during a classic-style sweep (which probably is somewhat perf-degenerate to other services' active queries on that same table) (and somewhat annoyingly I believe range scans at least in C* 2.2 aren't capable of kicking off their own read repairs, though the read path for direct key-based lookup is)

sandorw · 2018-10-03T14:39:41Z

What's the resolution here? Do we want to track having automatic repair infra before considering this?

clockfort · 2018-10-03T19:16:12Z

We caught up offline since we had a meeting that had James, Tom, and myself already.

We came to the conclusion that this commit is fine, but that we should be forcing repairs at least before some cluster-expansion operations that currently do not force a repair in our scripts.

…tch-1

set gc grace to zero

23625fa

Do we actually need this/benefit from this being greater than zero? It'd be real nice for short lived things to never end up in SSTables. As I understand it, read/write quorum and delete consistency all should save us from issues.

j-baker requested review from sandorw and tboam as code owners September 19, 2018 17:17

j-baker requested review from tpetracca and clockfort September 19, 2018 17:17

clockfort approved these changes Oct 3, 2018

View reviewed changes

Merge branch 'develop' of github.com:palantir/atlasdb into j-baker-pa…

87325be

…tch-1

sandorw closed this Mar 12, 2019

sandorw deleted the j-baker-patch-1 branch March 12, 2019 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set gc grace to zero #3513

set gc grace to zero #3513

j-baker commented Sep 19, 2018 •

edited by jboreiko

ashrayjain commented Sep 19, 2018

clockfort commented Sep 19, 2018

tpetracca commented Sep 19, 2018

j-baker commented Sep 21, 2018

clockfort commented Sep 21, 2018

sandorw commented Oct 3, 2018

clockfort commented Oct 3, 2018

set gc grace to zero #3513

set gc grace to zero #3513

Conversation

j-baker commented Sep 19, 2018 • edited by jboreiko

ashrayjain commented Sep 19, 2018

clockfort commented Sep 19, 2018

tpetracca commented Sep 19, 2018

j-baker commented Sep 21, 2018

clockfort commented Sep 21, 2018

sandorw commented Oct 3, 2018

clockfort commented Oct 3, 2018

j-baker commented Sep 19, 2018 •

edited by jboreiko