Allow multiple nodes to cooperate #159

nsoft · 2020-07-15T15:43:16Z

This is a placeholder/parent ticket for the key feature of 2.0 when we get there. This will generally include:

Cluster formation so that nodes can access a unified cassandra cluster.
A means to pass documents among nodes (JavaSpace, Cassandra or otherwise)
A means for newly started nodes to detect existing nodes and join
A means for nodes to leave gracefully
Handling of ungraceful node loss
Loading and unloading of Plans without stoping the cluster.

nsoft · 2020-07-15T15:47:26Z

Creating this ticket so I can note one difficulty we will face with Cassandra in the first of those. Here's a conversation from the ASF cassandra slack:

Ztyx 8:11 AM
Hello! We have an application that executed a CREATE TABLE IF NOT EXIST ... on boot. A couple of months ago we hit a node schema disagreement (and the table already existed) and our suspicion was that it had to do with that query. Anyone else hit this?

Jeff Jirsa 8:22 AM
Strictly not safe in current versions of cassandra to have multiple processes execute that command at the same time
8:23
It is, unfortunately, something that’s known, poorly documented, and has horrible horrible side effects, including potential data loss months later when you restart the instance
8:24
@ztyx if you must have the app make tables, use external locking - like zookeeper or something

gus 8:49 AM
@jeff Jirsa is this only a problem when the table didn't exist and 2 start up or is there a potential problem regardless of whether the table exists?
8:54
Is this it: https://issues.apache.org/jira/browse/CASSANDRA-15844 ?

ASF JIRA BridgeAPP 8:54 AM
CASSANDRA-15844: Create table Asynchronously or creating table contact the same node from many client threads at same time may causing data loss

Jeff Jirsa 9:32 AM
The failure modes I know about involve diverging cfid so id expect it to be mostly around create
9:33
Wouldn’t be surprised if alter statements also cause problems, but it’d be like migration task storms and GC pressure not data loss
9:34
15844 describes one shape of what I mentioned is possible yes
9:35
The race can result in like a dozen different states (different permutations of the race). One involves the cfid in schema table not matching the cfid in the table path on disk, that’s the one where If you bounce you end up losing that data because cassandra makes the “right” empty data directory on startup

nsoft added this to the 2.0 milestone Jul 15, 2020

nsoft added the enhancement label Jul 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow multiple nodes to cooperate #159

Allow multiple nodes to cooperate #159

nsoft commented Jul 15, 2020

nsoft commented Jul 15, 2020 •

edited

Allow multiple nodes to cooperate #159

Allow multiple nodes to cooperate #159

Comments

nsoft commented Jul 15, 2020

nsoft commented Jul 15, 2020 • edited

nsoft commented Jul 15, 2020 •

edited