-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timely worker synchronization for new dataflow creation #6
Comments
Btw, it is probably worth spec'ing out which parts of Materialize exist in which offerings. My sense is:
That's one hypothetical partitioning of value, but with something like that in mind we can more clearly determine which features need to go where. At the moment |
Currently, @benesch is working on an intermediate API, 'metastore'. The intention is for metastore to be backed by zookeeper for (2). But it's equally possible to have a shim 'zookeeper' that's a linked-list that satisfies (1) with no additional Apache cruft. The reason why zookeeper is such an attractive option is that if the primary streaming data ingest layer is Kafka, then Kafka users typically have Zookeeper running anyhow. |
Closing this out. I'm pretty happy with the end result, which uses ZooKeeper to store the metadata, but then the Chosen Node (worker node 0) pushes them through a timely sequencer for a consistent ordering. |
For posterity: getting ZooKeeper to expose a consistent stream of events is literally impossible. It's not built for that. I think it might be possible with etcd, but even if it is, it's definitely awkward. If we want that, we'd need to bundle a Raft implementation, and that seems like serious overkill. The solution of using ZooKeeper for persistence and timely's sequencer for ordering actually seems like the best option. |
All timely workers need to create new dataflows in the exact same order. This is arguably an implementation flaw - it is possible that timely workers could have a namespace and new dataflows could register against this namespace. But today, there is a single global counter from which channels are assigned for dataflows. Thus, every timely worker must create new dataflows in the same order.
In order to facilitate this ordering, the
interactive
crate has a meta-dataflow in which workers insert their intents to create new dataflows. This meta-dataflow assigns an ordering, which workers use to coordinate.One disadvantage of this hack is that dataflows are not persistent. if we want to be able to recover from crashes, we want to be able to see the history of dataflows we created (i.e. the views). Thus, replacing this meta-dataflow framework with a consensus-replicated durable log fulfills this other goal, as well as satisfying the original desideratum of having a consistent shared ordering.
We should rip out the meta-dataflow, and instead put all dataflow creation commands through a shared durable log (i.e. on top of zookeeper/etcd).
The text was updated successfully, but these errors were encountered: