Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Availability #391

Open
dain opened this Issue Mar 6, 2019 · 2 comments

Comments

2 participants
@dain
Copy link
Member

dain commented Mar 6, 2019

Goals

At the highest level, our goal is to create a highly available Presto setups. More specifically, we would like to accomplish the following:

  • Reduce impact of crashed coordinator
  • Simplify deployment architecture (e.g., fast failover, load balancers, etc.)
  • Reduce failures due to upgrades

Work Items

  • Dispatcher split
    • Move queue management into separate internal service (#95)
    • Secure submission query API
    • Split queued and executing client API
  • Dispatcher-only or coordinator-only server
    • Add only start necessary services depending on server role
  • Multiple dispatchers
    • Proxy for dumb load balancer
    • Durable resource groups
  • Multiple coordinators in a cluster
    • Add configuration to run shared discovery across coordinators
    • Add coordination for OOM killer and reserved memory pool
  • Multiple clusters
    • Cluster/coordinator discovery (each cluster has independent discovery)
    • Cluster selection system for queries

Deployment Architectures

There are multiple ways to combine the above work items into different setups to achieve different goals. The following sections describes the most common setups:

Multiple coordinators with Shared Queue

In this setup there is a single dispatcher containing the shared queue, and multiple coordinators in the cluster. If the dispatcher crashes, all queued queries fail, but executing queries will continue. If a coordinator crashes, all queries managed by that coordinator fail, but other queued queries and queries managed by other coordinators will continue.

This setup requires the following:

  • Dispatcher split
  • Dispatcher-only or coordinator-only server
  • Multiple coordinators in a cluster
  • (Optional) Multiple dispatchers

Highly Available Coordinators

In this setup there are multiple coordinators and each contains a dispatcher. Queued queries are durable, but executing queries can fail.

This setup requires the following:

  • Dispatcher split
  • Multiple dispatchers
  • Multiple coordinators in a cluster

Multiple Clusters with Shared Queue

In this setup there is a dispatcher tier that is managing a shared queue for multiple clusters. In the simplest form, there is a single dispatcher for all clusters and a single coordinator for each cluster. As above, if either fails, it only fails the queries being managed by that instance. This project has multiple followup projects to make the dispatcher HA, and to allow multiple coordinators in a cluster.

This setup requires the following:

  • Dispatcher split
  • Dispatcher-only or coordinator-only server
  • Multiple clusters
  • (Optional) Multiple dispatchers
  • (Optional) Multiple coordinators
@oneonestar

This comment has been minimized.

Copy link
Contributor

oneonestar commented Mar 6, 2019

We hope this patch will also:

  • Simplify the collaboration with resource orchestration platforms (K8S / Mesos)
  • Allow 0 downtime rollout upgrade
  • Allow load balancing between coordinators (for high concurrency queries where coordinator becomes the bottleneck)

Path 1: Multiple coordinators with Shared Queue

This approach requires "Multiple coordinators in a cluster" implementation which I think would be a difficult task. The cooperative scheduling of workers between coordinators could be complicated. I'd rather prefer taking Path 3 first and then move toward the outcome of Path 1.

Path 2: Highly Available Coordinators

I don't like this approach. The main issue is that the client has to decide which coordinator to connect or we have to add a load balancer in front of the coordinators.
Also, this doesn't allow Dispatcher-only mode and we must always have two+ coordinators online.

Path 3: Multiple Clusters with Shared Queue

The multiple clusters model allow 0 downtime rollout upgrade become easy. The most important advantage of this path I think is it split the big problem (highly available Presto) into many smaller problems. After the dispatcher module being implemented, different people can work on different followup projects including load balancing algorithm, multiple dispatchers, multiple coordinators in a cluster, etc.

I'll vote for Path 3 for its extensibility and step-by-step sprint development approach.

@dain

This comment has been minimized.

Copy link
Member Author

dain commented Mar 6, 2019

@oneonestar:

  • Simplify the collaboration with resource orchestration platforms (K8S / Mesos)

This isn't one of my goals, but if this helps great.

  • Allow 0 downtime rollout upgrade

All three of the designs will allow for this. For multi cluster it is straight forward; shut down one cluster, upgrade it and bring it back online. For multi coordinator, it is similar. The two key things to know is Presto coordinators and workers must have the same version, and there is a config option for a minimum number of workers. So you can upgrade a single coordinator, and the coordinator will become visible to the dispatchers, but it won't accept queries until it has enough workers it can use. Then you simply upgrade workers one at a time. When the number of upgraded workers cross the threshold, the coordinator starts excepting work. The full solution is a bit more complex.

  • Allow load balancing between coordinators (for high concurrency queries where coordinator becomes the bottleneck)

Only solutions with "Multiple coordinators" would do this.

Path 1: Multiple coordinators with Shared Queue
This approach requires "Multiple coordinators in a cluster" implementation which I think would be a difficult task. The cooperative scheduling of workers between coordinators could be complicated. I'd rather prefer taking Path 3 first and then move toward the outcome of Path 1.

From the first version of Presto, the system has always done "cooperative scheduling of workers between coordinators", so this isn't too difficult. Actually, there was a period at FB where we made all workers coordinators by accident, and everything just worked.

The complex part is really around decisions that must be globally made. Specifically, which query to promote to the reserved pool, or which query to kill in low memory, must be coordinated to prevent multiple coordinators making conflicting decisions at the same time.

Path 2: Highly Available Coordinators
I don't like this approach. The main issue is that the client has to decide which coordinator to connect or we have to add a load balancer in front of the coordinators.
Also, this doesn't allow Dispatcher-only mode and we must always have two+ coordinators online.

In this design, the client contacts a dispatcher service which happens to be running in the same process as the coordinators. This is "Multiple dispatchers" project mentioned above, and it requires "Proxy for dumb load balancer". Anything that has HA for the front end will require some kind of dumb load balancer, but even DNS should work.

In this design, you could scale down all the way to a single coordinator (containing a dispatcher). The dispatcher service would accept and queue queries, but would not start them until enough workers showed up. If the dispatcher failed, another one would need to be started before the clients timed out, but assuming it did, it would rebuild the queues and the clients continue.

Path 3: Multiple Clusters with Shared Queue
The multiple clusters model allow 0 downtime rollout upgrade become easy. The most important advantage of this path I think is it split the big problem (highly available Presto) into many smaller problems. After the dispatcher module being implemented, different people can work on different followup projects including load balancing algorithm, multiple dispatchers, multiple coordinators in a cluster, etc.

I think those points are true of all of the above designs.


My main concern with approach 3 is I think it might only be good for really large users that have multiple clusters and are ok with and additional tier of dispatcher machines. Additionally, I don't think is actually makes Presto more highly available, which is what people have been asking for. Anyway, I am happy to work on whatever the community wants done first.

@dain dain added the roadmap label Mar 13, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.