Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a better default for num_tokens #324

Closed
jsanda opened this issue Feb 5, 2021 · 13 comments · Fixed by #382
Closed

Provide a better default for num_tokens #324

jsanda opened this issue Feb 5, 2021 · 13 comments · Fixed by #382
Assignees
Projects
Milestone

Comments

@jsanda
Copy link
Contributor

jsanda commented Feb 5, 2021

Is your feature request related to a problem? Please describe.
The default behavior of cass-operator is to create the Cassandra cluster with vnodes enabled and a single token per node. As an aside cass-operator does not support manual token assignments.

The out of box default for Cassandra is num_tokens: 256. Experience over time has shown that this is not a good default. In fact a couple Netflix engineers published a white paper titled Cassandra Availability with Virtual Nodes that discusses how 256 tokens actually decreases availability.

Given the problems with using 256 tokens, it makes sense that cass-operator chooses a different value. The nodetool output from a couple test clusters demonstrate the problem with a single token:

# 3 node cluster

$ kubectl exec -it test-dc1-default-sts-0 -c cassandra -- nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.40.0.8   70.94 KiB  1            31.8%             42f9ff07-927e-4e81-8de7-59232d28de89  default
UN  10.40.1.67  71.07 KiB  1            88.9%             f4729051-86be-42f1-9d0b-a09d04761795  default
UN  10.40.2.20  85.29 KiB  1            79.3%             37c7b5bb-a573-439e-bcc8-8b1c9fd08217  default
# 9 node cluster

$ kubectl exec -it test-dc1-default-sts-0 -c cassandra -- nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.40.0.10  90.23 KiB  1            77.5%             5af74306-ec08-4176-aaad-6dae7ad9f950  default
UN  10.40.2.26  90.23 KiB  1            39.9%             a3a22bcf-6e61-4135-a017-3c797fd876f7  default
UN  10.40.5.3   90.36 KiB  1            79.4%             78d82669-2cf9-4a7e-841d-4bc9ffaebbd0  default
UN  10.40.6.3   30.4 KiB   1            83.7%             718f3945-0bad-4c81-bd0c-06680daeec62  default
UN  10.40.7.3   71.07 KiB  1            78.3%             324c3fc5-fe3d-4c3d-929a-9409c8370eee  default
UN  10.40.8.3   90.23 KiB  1            36.6%             c373a0fa-f208-4b80-bc03-fcbb41a89ab9  default
UN  10.40.4.5   71.07 KiB  1            73.0%             edc40ba7-d48d-4170-b60a-68995cf4f2ca  default
UN  10.40.1.21  90.36 KiB  1            31.5%             407f920f-89ad-4e69-8274-6cacaf6baf6a  default

Token range ownership is way out of balance which can cause major issues as the load in the cluster increases.

Describe the solution you'd like
We need a more sensible default for num_tokens for the 1.0 release.

@adejanovski you have done a good bit of analysis in this area. What are your thoughts? Keep in mind we are looking for a better default for 3.11.x in particular at the moment and also that the even token distribution algorithm is not an option (at least not in the 1.0 time frame).

@jsanda jsanda added this to To do in K8ssandra via automation Feb 5, 2021
@jsanda jsanda added this to the 1.0.0 milestone Feb 5, 2021
@adejanovski
Copy link
Contributor

@jsanda, if we cannot use the new token allocation algorithm, then we need to stick with 256 vnodes. We learned to live with it and the biggest problem with it was repair. Since k8ssandra ships with Reaper and Reaper groups token ranges with the same replicas into segments that will be processed by 3.11 in a single repair session, we're mostly fine.

There's a little trick we could use though in order to go as low as 16 and still use the new algorithm:
Seed nodes should start without allocate_tokens_for_keyspace being set
Non seed nodes should start with allocate_tokens_for_keyspace: system_distributed.

The system_distributed KS will be created with SimpleStrategy and RF=3 automatically when seed nodes are started. Non seed nodes will then bootstrap in the same way 4.0 will do using allocate_tokens_for_replication_factor: 3.

Is this something we could pull off with cass-operator?

@JeremiahDJordan
Copy link

JeremiahDJordan commented Feb 12, 2021

Can we not use this property to the Management API to get the Keyspace setup to use the allocate_tokens_for_keyspace?

      - -Dcassandra.system_distributed_replication_dc_names=dc1
      - -Dcassandra.system_distributed_replication_per_dc=3

@jsanda
Copy link
Contributor Author

jsanda commented Feb 12, 2021

@JeremiahDJordan That's a great question. My understanding of what needs to happen is based off of https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html. Assuming we need to implement some or all of those steps I think it would have to happen after 1.0.

Presumably, cass-operator would need to calculate the tokens. AFAIK it does not do this currently.

I am not sure if we would need changes in cass-operator so that it starts the seed node per rack vs bringing a rack up at a time. I assume the latter since it is the StatefulSet controller that manages the pod creation.

Lastly I am not sure if cass-operator and config-builder support specifying different tokens for nodes.

@JeremiahDJordan
Copy link

As long as there is a keyspace existing with a good RF, then you can use allocate_tokens_for_keyspace. So I think using the Management API properties that default the distributed keyspace replication to a good RF may do what is needed. And then you just use one of the keyspaces that property affected in allocate_tokens_for_keyspace. Then you could use that and 8 or 16 tokens or what ever.

@JeremiahDJordan
Copy link

Presumably, cass-operator would need to calculate the tokens. AFAIK it does not do this currently.

Not sure what you are referencing here. Shouldn't need to calculate any tokens, just setting allocate_tokens_for_keyspace and num_tokens

@jsanda
Copy link
Contributor Author

jsanda commented Feb 12, 2021

Assuming we cannot use allocate_tokens_for_keyspace (or at the very least it is out of scope for 1.0), we will use num_tokens: 256 for 3.11 and not make it configurable.

If we are running 4.0, then we should use the 4.0 default of 16 tokens since the smart token allocation is enabled by default.

@jsanda
Copy link
Contributor Author

jsanda commented Feb 12, 2021

Not sure what you are referencing here. Shouldn't need to calculate any tokens, just setting allocate_tokens_for_keyspace and num_tokens

I am basing my limited understanding off of https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html. It outlines the steps that need to be performed. The first step it mentions is calculating and setting tokens fo seed nodes in each rack.

@JeremiahDJordan
Copy link

Ah, you don't need to do that. The seed nodes should just do random calculation. Then the non-seed nodes use the algorithm to "fill out" the ring. The seed node purposefully use random distribution, the algorithm was designed with a random starting point in mind.

@jsanda
Copy link
Contributor Author

jsanda commented Feb 12, 2021

@JeremiahDJordan can you provide a short list of the steps? Maybe it is less work than I originally thought. If it is doable, then we could use the same default as 4.0 which would greatly simplify upgrades.

@JeremiahDJordan
Copy link

JeremiahDJordan commented Feb 12, 2021

@jsanda it should just be

  1. make sure there is a keyspace with your RF existing in the schema. Which I think the following should give you. (So you would need to always do it, it looks like right now k8ssandra only sets these if auth is enabled).
      - -Dcassandra.system_distributed_replication_dc_names=dc1
      - -Dcassandra.system_distributed_replication_per_dc=3
  1. set allocate_tokens_for_keyspace and num_tokens

@jsanda
Copy link
Contributor Author

jsanda commented Feb 12, 2021

I just confirmed that cass-operator starts seed nodes first so I think using allocate_tokens_for_keyspace is totally doable.

I suppose we can default to RF=3 and expose a property in values.yaml for the RF. We should include detailed documentation with the property that explains allocate_tokens_for_keyspace is used and the RF should be set to whatever the application is going to use.

@jsanda
Copy link
Contributor Author

jsanda commented Feb 15, 2021

It looks like we will need to add support for allocate_tokens_for_keyspace in cass-config-definitions. I see it in the DSE templates but not in the C* ones.

@jsanda
Copy link
Contributor Author

jsanda commented Feb 17, 2021

@burmanm since we do not yet have support for allocate_tokens_for_keyspace we will go with a default of 256 for num_tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
K8ssandra
  
Done
Development

Successfully merging a pull request may close this issue.

4 participants