Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Metadata - Long Term Plan #4

Open
martinsumner opened this issue Feb 6, 2024 · 1 comment
Open

Cluster Metadata - Long Term Plan #4

martinsumner opened this issue Feb 6, 2024 · 1 comment

Comments

@martinsumner
Copy link

martinsumner commented Feb 6, 2024

There exists two primary ways of agreeing on metadata within a Riak cluster.

  • The ring;
    • For non-typed buckets, their properties are stored on the ring.
    • Otherwise the ring is reserved for information about the allocation of partitions within the cluster (and current known status of cluster members).
    • The ring is gossiped (using riak_core_gossip) and persisted on changes, and conflicting changes are merged.
    • Although changes to the cluster membership are made through the claimant, changes to bucket properties bypass the claimant and are made directly.
    • There may be potential issues with changing bucket properties when planning cluster membership changes
    • When the ring is updated, the bucket properties are flushed and re-written to an ETS table from where they are read.
    • After 90s without a ring change, the latest ring has read access optimised using riak_core_mochiglobal (essentially compiling it as a module) - but requests for bucket properties only ever use the ETS table.
    • Each node sends its ring to another random node every 60s, even if there are no changes, so eveyr node should eventually find out about a new version of the ring, even if the node was down when a ring change was gossiped.
  • Cluster metadata;
    • For typed buckets and bucket types, their properties are stored in cluster metadata.
    • Configuration for riak_core_security is also stored in cluster metadata.
    • There are no other uses for cluster metadata.
    • Cluster metadata is entirely separate to the ring. It is gossiped using riak_core_broadcast, and stored within a DETS file, with a read-only version promoted to an ETS table.
    • Reading bucket properties for typed/non-typed buckets follows two different code paths (as one is a read from the ring, and one a read from cluster metadata) - but code paths ultimately lead to an ets:lookup/2 (although in the case of a typed bucket two lookups are required).
    • If cluster metadata gets out of sync between nodes, the differences are detected by active anti-entropy (using the riak_core_metadata_hashtree module, which then uses hashtree and eleveldb).
    • For amanging eventual consistency in cluster metadata the dvvset module is used. This is the only use of dvvset within riak_core.
    • Although typed bucket property changes do not use the ring, they are channeled via the claimant node.

Some notes:

  • Both the use of an ets table as a read-only cache, and the use of riak_core_mochiglobal seem to anachronistic in the context of erlang persistent_term.
  • The most obvious change is to replace the use of riak_core_mochiglobal with erlang persistent_term for storing the ring (and the ring epoch). There is already the delayed protection of promote_ring to prevent over-frequent updates.
  • If riak_core_security is not used, and buckets are either generally un-typed, or limited in number - the riak_core_metadata approach seems excessive.
  • The riak_core_metadata was added for a reason though - is riak_core_metadata/riak_core_broadcast a more robust and scalable approach than riak_core_ring/riak_core_gossip?
  • The need for AAE in riak_core_metadata creates a dependency on eleveldb. eleveldb itself can be removed, but that doesn't eliminate the complexity of the process. Also, this PR changes the underlying store for hashtree - and so has an impact on those using AAE for KV anti-entropy.
  • Riak core metadata is subject to hard limits with the 2GB maximum size of DETS table files.
  • The need for AAE is to ensure eventual consistency (if a node was down and missed broadcasts, it must discover the discrepancy via AAE), but given a 2GB maximum size, perhaps simpler methods than hashtree can be used for comparisons e.g. iterate over keys and comparing directly.
  • There is the additional complexity of fixups (which is currently only used by riak_repl); fixups allow an application to register a need to filter all bucket property changes before they are applied (in the case of riak_repl this is used to ensure that the riak_repl post-commit hook is applied on every bucket).
  • The net effect of the different methods for holding cluster metadata is confusing.
  • There are no known issues with the current setup; it may be confusing, but it works.

Should there be a long-term plan to change this? To unify or simplify the methods of storing and gossiping information required across the cluster? Is there sufficient potential reward to justify the risk of any change (particularly wrt to the overheads of testing backwards compatibility).

@martinsumner
Copy link
Author

See also #5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant