Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion of weaker failure modes #3

Open
pbailis opened this issue May 20, 2013 · 4 comments
Open

Discussion of weaker failure modes #3

pbailis opened this issue May 20, 2013 · 4 comments

Comments

@pbailis
Copy link

pbailis commented May 20, 2013

I think there are two major causes of confusion when it comes to "beating CAP." Number one is the fault model that Gilbert and Lynch assume. (Number two is the confusion over "application-level" versus "CAP" consistency, as in #4.)

I've recently seen several discussions of CAP (even in academic publications) that discuss the availability requirement. Neither of these is actually Gilbert and Lynch HA, but, for their relaxed failure domain, guarantee a response. Here are two examples:

  • "up to F server faults": If you can contact a majority of servers, you can get a response. Not "HA" as minority servers may be partitioned. The HyperDex paper, Section 8 states this assumption rather clearly, noting it is "thus able to provide seemingly impossible guarantees."
  • "for specific fault model[s]": If we provision networks appropriately, and partitions never happen, there many be no partitions! The Windows Azure Storage paper, Section 8 discusses this. It's stronger than the asynchronous model and is not HA (I'll not speculate as to how realistic this is, but the paper is fairly adamant that the system circumvents CAP.)

I'm not quite sure how to best address these in the text, but it might be useful.

Two concrete suggestions:

Under "15. Is a failed machine the same as a partitioned one?" the FAQ could mention that, in an HA system, a minority partitioned server still needs to guarantee a response.
Under "12. Is my network really asynchronous?" the FAQ could mention that, in the limit, failures can render any communication network asynchronous.

Alternatively, (at the risk of starting a "list of shame"), the FAQ might expand "17. Have I 'got around' or 'beaten' the CAP theorem?" into a "list of common fallacies" like those above.

I'm curious what you think and am happy to drop a pull request if there's interest.

@henryr
Copy link
Owner

henryr commented May 20, 2013

wrt "up to F server faults": I agree, there's abuse of this model all over the place (including in a bunch of things I wrote).

There's a trivial analogue of CAP for this failure mode that says there's always a non-total failure mode that causes you to give up either consistency or availability of the non-failed machines in an asynchronous network. I think this is true, and scribbled down a proof once. It's this, rather than network partitions, that I think leads more directly to quorum models and the tradeoff between latency and availability that Daniel Abadi wrote about.

I think your suggestions are good: I included "15 Is a failed machine the same as a partitioned one" deliberately to address the difference between the two failure models, and making this more explicit is a good thing.

What do you mean by "in the limit, failures can render..."? A crash-stop failure means messages are never received, rather than eventually received after an arbitrary delay.

@pbailis
Copy link
Author

pbailis commented May 20, 2013

"There's a trivial analogue of CAP for this failure mode that says there's always a non-total failure mode that causes you to give up either consistency or availability of the non-failed machines in an asynchronous network."

I think I'd agree; the idea of a (series of) non-total failure(s) that still leads to unavailability reminds me of the bivalency argument from FLP. However, I've never found bivalency to be as intuitive as, say, the ability to arbitrarily place and schedule partitions.

"...the tradeoff between latency and availability that Daniel Abadi wrote about."

The notion of HA as a proxy for "coordination complexity" is really exciting. I didn't yet make an issue, but I think the latency-consistency trade-off he's highlighted is terribly important.

"What do you mean by 'in the limit, failures can render...'? A crash-stop failure means messages are never received, rather than eventually received after an arbitrary delay."

Ah, sorry. I was trying to say something much more mundane. I meant to say "even if you build fifty redundant networks, in the extreme, you'll still be unavailable when all fifty fail"; this is related to your discussion in (10): your network is probably asynchronous. Hmm. Perhaps it's best to simply beef up (12) to say something like: "A synchronous network must guarantee deliver 100% of messages within a finite amount of time; if any message is delayed indefinitely, the network is asynchronous." Writing this up, I'm realizing how difficult it is to map these models to real-world networks (more kudos on your work so far!), but I think this is where the confusion lies. Perhaps giving a few examples would be useful. (e.g., "what if I have redundant networks?") But, that said, maybe this is splitting hairs.

@johnwcowan
Copy link

In that sense there are no synchronous networks: when the Earth explodes, any messages in transit will be lost forever.

@pbailis
Copy link
Author

pbailis commented Aug 20, 2013

One can only hope that the development and deployment of inter-planetary distributed databases will predate the Earth's explosions!

But, more seriously, while synchronous networks are indeed unrealistic in many cases, they allow for a range of "stronger" semantics/mechanisms than asynchronous networks (e.g., http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.31.5990&rep=rep1&type=pdf).

I think these synchronous networks have use beyond distributed systems theory in that they allow us to consider systems designs under a much less rigorous set of failure scenarios. If your networks don't partition, you can achieve a lot more than if they do. The real question is what happens when the failure scenarios that you assumed weren't going to occur actually do occur--the correct answer is usually unavailability but can be worse in practice if these corner cases haven't been accounted for in the protocol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants