New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cassandra on Kubernetes - Seed Provider and Snitch #24286

Closed
chrislovecnm opened this Issue Apr 14, 2016 · 30 comments

Comments

Projects
None yet
7 participants
@chrislovecnm
Member

chrislovecnm commented Apr 14, 2016

Background

This project currently contains a custom SeedProvider that allows for discovery within a kubernetes cluster. This issue is covering design and improvements around this initiative. Once stable these components are stable recommended that this code gets moved into the Apache Cassandra code base. But since the code lives here ... This is where we start.

Definitions

Seed Provider / Gossip

https://docs.datastax.com/en/cassandra/3.x/cassandra/configuration/configCassandra_yaml.html?scroll=configCassandra_yaml__seed_provider

When a node first starts up, it looks at its cassandra.yaml configuration file to determine the name of the Cassandra cluster it belongs to; which nodes (called seeds) to contact to obtain information about the other nodes in the cluster; and other parameters for determining port and range information.

seed_provider
The addresses of hosts deemed contact points. Cassandra nodes use the -seeds list to find each other and learn the topology of the ring.
class_name (Default: org.apache.cassandra.locator.SimpleSeedProvider)
The class within Cassandra that handles the seed logic. It can be customized, but this is typically not required.

  • seeds (Default: 127.0.0.1)
    A comma-delimited list of IP addresses used by gossip for bootstrapping new nodes joining a cluster. When running multiple nodes, you must change the list from the default value. In multiple data-center clusters, it is a good idea to include at least one node from each data center (replication group) in the seed list. Designating more than a single seed node per data center is recommended for fault tolerance. Otherwise, gossip has to communicate with another data center when bootstrapping a node. Making every node a seed node is not recommended because of increased maintenance and reduced gossip performance. Gossip optimization is not critical, but it is recommended to use a small seed list (approximately three nodes per data center).

Snitches - https://docs.datastax.com/en/cassandra/3.x/cassandra/architecture/archSnitchesAbout.html

A snitch determines which data centers and racks nodes belong to. They inform Cassandra about the network topology so that requests are routed efficiently and allows Cassandra to distribute replicas by grouping machines into data centers and racks. Specifically, the replication strategy places the replicas based on the information provided by the new snitch. All nodes must return to the same rack and data center. Cassandra does its best not to have more than one replica on the same rack (which is not necessarily a physical location).

Tasks

  • Document Current State - Document how the current provider operates, and is configured
  • Asses Architecture - Are there any short term changes that are recommended
  • PetSet - Determine architecture with future needs of the PetSet initiative
  • Write code and drink much Scotch or Green Tea.
  • Determine needs for allowing racks and multiple datacenter implementations.

Current Challenges

  • Documentation is lacking
  • Current seed provider sets every C* node as a seed ... not the best
  • Code is not tested well
  • Code is not built on correct version
  • No multi rack support
  • No multiple DC support
@bgrant0607

This comment has been minimized.

Member

bgrant0607 commented Apr 14, 2016

@chrislovecnm

This comment has been minimized.

Member

chrislovecnm commented Apr 15, 2016

From @bprashanth .... what do you think?
@bgrant0607 - thoughts?

So from a quick look at the seed provider, I have a couple of questions. Looks like we're using endpoint ips https://github.com/chrislovecnm/kubernetes/blob/cassandra-seed-provider-docs/examples/cassandra/java/src/io/k8s/cassandra/KubernetesSeedProvider.java#L149

That is correct!

What happens If the Service just doensn't have endpoints populated yet? (most people put the service def in the same file as the RC used for cassandra and kubectl create -f, so if the cassandra pods start running before the endpoints are populated all of them will set themselves as the seed right?)

It looks like the provider will drop back to the cassandra.yamlconfig

- seeds: "110.82.155.0,110.82.155.3"

Also C* defaults to 127.0.0.1. More testing needs to occur.

What happens if the IP of your seed goes down and gets re-assigned to some other container (either a cassandra node in your cluster, another cassandra cluster or like an unrelated nginx pod) and we're bringing up a node that still observes the old list of endpoints (so it gets the wrong ip).

This provider needs to be tweaked to only provide 2 or some number of active seeds.

Using the global endpoints list and IP doesn't sound reliable. I'd rather have each c* pod update, eg a config map with its own hostname (which it gets by running hostname), and takes the first member of the list in the config map as the seed.

So a couple of things:

  • We need 2-3 C* seed nodes for redundancy
  • Picking the same nodes all the time would be helpful. Is the Endpoints endpoints = mapper.readValue(conn.getInputStream(), Endpoints.class); sorted?
  • How would we configure the config map before the cassandra nodes are up? I am thinking we have a causality dilemma.

If we only start one at a time, it's a guaranteed single writer to that config map. Changes to the configmap are reflected in all pods that mount it. That hostname will never get re-assigned even when the pod gets rescheduled.

We may be saying the same thing. But a use case is this:

  1. We deploy six C* nodes in a cluster.
  2. First node uses 127.0.0.1 as the seed
  3. Second node uses 127.0.0.1 and or first nodes ip and 127.0.0.1.
  4. x number of C* nodes need to be seeds. 2 or 3 seeds is recommended.
  5. If one of the C* seeds is lost the provider needs to responds with two valid C* node ip addresses.
  6. nodetool decommission has to be run to remove the dead node.

Doing that won't be easy without petset, but just throwing it out there to get your thoughts.

I think we are ok even without PetSet

@bprashanth

This comment has been minimized.

Member

bprashanth commented Apr 15, 2016

so say you want 2 seeds.

The first node comes up, updates the config map with the output of hostname (no endpoint watching or IPs).
config map=[c1.default.cluster.local]
config file on disk for c1: seeds [c1]

Second node comes up, gets the config map, sees it's the second node and appends to the list:
config map=[c1, c2]
config file on disk for c1: [c1, c2]
config file on disk for c2: [c1, c2]

Each node comes up, see there are already 2 nodes in the map and just set those as seeds.
The important part here is (I think):

  1. Single writer to config map, so there's no race for seed position
  2. c1 and c2 are hostnames, and don't ever get reassigned to pods from another c* cluster
  3. If c1 or c2 go down, they will always comes back up with the same volume (that's not to say there isn't a partition scenario where we wouldn't need a babysitter process, but lets go for the simple case first)

How would we configure the config map before the cassandra nodes are up? I am thinking we have a causality dilemma.

"Init containers" or "pre-start" hooks is a work in progress, but even today, you can create an entrypoint script that first updats config map and then starts the c* daemon.

I think we are ok even without PetSet

Of course it will work, without petset you need to create 1 service per node to get persistent ips/hostnames, and you can't guarantee single writer to any shared state because the RC will start a new pod ASAP (i.e while the old one is running, so in your 1 Service per c* pod example you will have 2 c* pods behind the service for some span of time). Also there will always be a propogation delay between when an ip can get re-assigned to some other pod in your cluster and the endpoints list is updated. It's possible to build a seed provider that works aroud these, but I'm not sure it's worth it.

@chrislovecnm

This comment has been minimized.

Member

chrislovecnm commented Apr 15, 2016

Also openned https://issues.apache.org/jira/browse/CASSANDRA-11585 requesting assistance, and contacted a couple of gurus.

@chrislovecnm

This comment has been minimized.

Member

chrislovecnm commented Apr 15, 2016

PR is in with intial documentation for how the Seed Provider works: #24296

@chrislovecnm

This comment has been minimized.

Member

chrislovecnm commented Apr 28, 2016

PR is in updating example #24945

@chrislovecnm

This comment has been minimized.

Member

chrislovecnm commented Apr 28, 2016

PR is in refactoring SeedProvider #24945

@shashiranjan84

This comment has been minimized.

shashiranjan84 commented Aug 26, 2016

chrislovecnm is there a way to control the version of Cassandra I want to use. For my application(Kong) which currently only supports Cassandra 2.2, I tried using gcr.io/google-samples/cassandra:v8 but its failing with
screenshot 2016-08-26 14 23 25

gcr.io/google-samples/cassandra:v9 works fine but it comes with Cassandra 3.5.

Thanks
Shashi

@chrislovecnm

This comment has been minimized.

Member

chrislovecnm commented Aug 26, 2016

Which version of k8s are u using? What version of Cassandra do you need?

@shashiranjan84

This comment has been minimized.

shashiranjan84 commented Aug 26, 2016

@shashiranjan84

This comment has been minimized.

shashiranjan84 commented Aug 27, 2016

Looking for Cassandra v 2.2.x

@chrislovecnm

This comment has been minimized.

Member

chrislovecnm commented Aug 27, 2016

I think the problem you are having is with the seed provider, and you need to use a version that is compiled against c* 2.2 binaries. Have not looked if we released a version.

The other option, which is in alpha with 1.3 is pet set. Here is my PR #30577 (comment)

@chrislovecnm

This comment has been minimized.

Member

chrislovecnm commented Aug 27, 2016

To be clear if use my pet set example you don't need the seed provider, also you will need to build you own docker. The docker will need to use the correct cassandra.yaml.

@shashiranjan84

This comment has been minimized.

shashiranjan84 commented Aug 29, 2016

ok thanks @chrislovecnm, I will give it a try.

@bgrant0607 bgrant0607 removed the help-wanted label Aug 30, 2016

@chrislovecnm

This comment has been minimized.

Member

chrislovecnm commented Sep 17, 2016

@bgrant0607 #30577 is merged, closing.

@GordonJiang

This comment has been minimized.

GordonJiang commented Oct 7, 2016

@chrislovecnm, see this thread be closed 20 days back. Is it resolved? Are we able to deploy cassandra in Petset for multiple datacenter in kubernetes? The key is cassandra nodes in different datacenter needs to talk to each other for data replication. How can we achieve it in kubernetes?

@chrislovecnm

This comment has been minimized.

Member

chrislovecnm commented Oct 8, 2016

We are able to deploy multiple DC inside the same cluster. Theoretically with the correct routing you should be able to route cross cluster, but have not done that. So I have not had time to try 2x C* DC in 2x K8s clusters. We should have another issue open about the routing.

@blak3mill3r

This comment has been minimized.

blak3mill3r commented Oct 8, 2016

I've done something similar on AWS with IPsec for Kafka replication. If you make a virtual bridge across regions between 2 vpc subnets, the nodes will all see stable addresses/names

@chrislovecnm

This comment has been minimized.

Member

chrislovecnm commented Oct 8, 2016

@blak3mil3r do you have that documented?

@chrislovecnm

This comment has been minimized.

Member

chrislovecnm commented Oct 8, 2016

Sorry I typo'ed your name @blak3mill3r and am on mobile ... Edit button GitHub

@blak3mill3r

This comment has been minimized.

blak3mill3r commented Oct 9, 2016

@chrislovecnm Sort of... not very well documented, but I do have notes. I should clarify though: I'm only talking about setting up a secure VPN tunnel connecting two private VPC subnets in different regions... I did not combine that with k8s Petsets. I can't see why it wouldn't work though, if k8s exposes the nodes in a Petset to the VPC, and the IPsec tunnel allows connecting from the remote VPC ...

I'm interested in trying it, but unfortunately I won't be able to dedicate time to it right away. I'd be glad to dig up my notes on the tunneling config and share them (not at the office now).

@blak3mill3r

This comment has been minimized.

blak3mill3r commented Oct 9, 2016

After reading more on Petsets, it's not clear to me that it supports exposing each pet as a stable IP/name.

@chrislovecnm

This comment has been minimized.

Member

chrislovecnm commented Oct 9, 2016

@blak3mill3r first can we open an issue, secondly static IPs are an enhancement that I have requested.

@surajpasuparthy

This comment has been minimized.

surajpasuparthy commented Oct 11, 2016

@chrislovecnm
Would be possible to have a tunnel between the 2 k8s clusters, enable source NAT on the kube proxy to enable handshaking between the cluster IPs?
The POD IPs would be private in the K8S network. so would just a handshake between the service IPs of C* on both the K8s be sufficient to set up a sync across clusters ?

@chrislovecnm

This comment has been minimized.

Member

chrislovecnm commented Oct 11, 2016

Every C* instance has to talk to every other C* instance.

@chrislovecnm chrislovecnm reopened this Oct 11, 2016

@surajpasuparthy

This comment has been minimized.

surajpasuparthy commented Oct 11, 2016

@chrislovecnm
Ah, i feared as much.
So is Ubernetes the ONLY way i can expose the POD ips on say DC1 to DC2 ? is there any possible configuration i could have to set up multi DC sync across kubernetes clusters without ubernetes.
At the moment, i have a workaround by using the "hostNetwork=true" for the cassandra deployment yaml file. Doing this will assign my VMs ip as the pod ip for cassandra. The limitations being i can have only 1 pod for cassandra and ofcourse the host network being exposed.
Any suggestion to can get around this issue?

@chrislovecnm

This comment has been minimized.

Member

chrislovecnm commented Oct 11, 2016

@surajpasuparthy no this is a routing problem not a federation / ubernetes problem. You have to route between nodes, which is a pure networking issue.

@surajpasuparthy

This comment has been minimized.

surajpasuparthy commented Oct 11, 2016

@chrislovecnm
I see, could you give me an idea about how i can set up this networking ?
So far, i only can think of a tunnel (TLS) between the 2 VMs(2 single node K8S clusters). but that will still forward the handshake only to the svc ip and the Pod IPs are private to the cluster.
Thanks in advance

@chrislovecnm

This comment has been minimized.

Member

chrislovecnm commented Oct 11, 2016

#27239 is where I would ask you to continue this conversation, and I will close this again.

@chrislovecnm

This comment has been minimized.

Member

chrislovecnm commented Oct 17, 2016

Closing this please refer to #27239 in order to network K8s clusters together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment