Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale to larger clusters #10174

highker opened this issue Mar 16, 2018 · 9 comments

Scale to larger clusters #10174

highker opened this issue Mar 16, 2018 · 9 comments


Copy link

@highker highker commented Mar 16, 2018

Presto clusters currently only support a single coordinator. The number of worker nodes that a cluster can support is, therefore, limited by the resources (CPU and memory) available at the coordinator. When the number of nodes required to serve queries in a given region exceeds the maximum size of a single cluster, currently there are only two options available:

1. Improve coordinator efficiency to increase the worker:coordinator ratio
There is likely more room for improvement in this ratio, but we're certainly at the point of diminishing returns on engineering work. In addition, it is likely much more impactful to focus on worker efficiency improvements (given that there are three orders of magnitude more workers than coordinators)

2. Run multiple clusters and load-balance between them
Unless the load balancer has an understanding of resource management configuration and resource group state of the clusters, it can only do approximate routing.

Neither of these are particularly appealing options.

In the long term, we likely want to support clusters with many thousands of nodes with a pool of coordinators orchestrating query execution.

In the short term, we want to create a query dispatcher that is responsible for admission control and which can route queries to a coordinator when resources are available.

The rest of this document describes this short term plan (currently called 'query dispatcher').

Broad Steps

  1. Create a new dispatcher node type that is responsible for admission control and supports routing queries to a single coordinator
  2. Add support for routing to one of multiple coordinators in the cluster, supporting a static partitioning of workers into Worker Pools, each of which are managed by a coordinator
  3. Support dynamic partitioning of workers into Worker Pools

Design Constraints

  1. When running a single coordinator, administrators should have the ability to colocate dispatcher and coordinator (to minimize deployment changes). This may also be important for latency-sensitive applications.
  2. Compatibility with current client protocol. The community has a variety of clients (written in different languages) that can talk with Presto cluster through HTTP /v1 end points. We should not make breaking changes to this protocol.

Tentative Feature Design

  1. UI - Summary info for running queries will be stored on dispatcher, and query detail info will be stored on coordinators
  2. Routing - Dispatcher will use HTTP redirects to route to coordinators, and will hand-off execution completely at that point
  3. Resource Management - Coordinators will have no admission control capabilities, dispatcher will perform admission control based on query summary statistics
  4. Transactions - Cookie-based transaction control - always route queries in the same transaction to the same coordinator
  5. Cluster Memory Management - When a single cluster can have multiple coordinators, there should be a service to control the cluster memory usage. It may live on the dispatcher or one per cluster
  6. Discovery - Similar to cluster memory management, discovery can live on the dispatcher or one per cluster

Known Problems

  • Discovery + coordinator failure detection
  • Query state/results client
  • Materialized view scheduling (future)
  • Multi-stage programs (future)
    • Parking queries and resuming
  • Query failure recovery
  • (likely many more)
@raghavsethi raghavsethi changed the title Support multiple coordinators in a cluster Scale to larger clusters Mar 18, 2018
@highker highker removed their assignment May 16, 2018
Copy link

@famosss famosss commented Aug 15, 2018

Which version can support dispatcher node?

Copy link

@dain dain commented Aug 15, 2018

@raghavsethi and I are working on this feature now

Copy link

@HariSekhon HariSekhon commented Nov 26, 2018

Regarding scalable query execution, peer based systems like Elasticsearch, Impala, Apache Drill etc which can accept queries on any node allow for the advantage of simple round robin load balancing in front of the nodes to scale the queries without them being a single coordinator bottleneck.

Copy link

@dain dain commented Nov 26, 2018

@HariSekhon, that may be possible, but having dedicated coordinator instances, will be less risky. Workers can become less responsive when computationally or network heavy workloads are running, and this can effect query performance and reliability. For very large clusters having a few coordinators (maybe 4-8), should be more than sufficient to handle most workloads.

BTW, the comment "nodes to scale the queries without them being a single coordinator bottleneck", isn't a big problem for most people. FB runs 800 node clusters over Hive data with a single coordinator. For places that need more compute, they run multiple clusters, which you are going to want for maintenance purposes anyway.

Copy link

@xqliu xqliu commented Nov 27, 2018

Copy link

@HariSekhon HariSekhon commented Nov 27, 2018

@dain good points, Elastic allows you to separate your coordinators for this reason, but at least they allow multiple of them in a peer like system for HA.

I think if Presto allowed control of the advertised coordinator address for worker nodes to reply to, then you could set that to the DNS address of the VIP and HA Coordinators would be a straightforward load balancer active/passive setup.

Since AWS has taken Presto for Athena, I wonder if they have coordinator HA given they are always on... suspect they might just load balance multiple single coordinator clusters... anyone know?

Copy link

@dain dain commented Nov 27, 2018

@HariSekhon, sorry I wasn't trying to make a statement about HA. This work is a precursor to having multiple coordinators, which will help reduce the impact of any one of them failing. Also, when it comes to HA, we will need to make the dispatcher (specifically the queues) HA, which will help in the event that a dispatcher crashes. One thing to note, is the reason this work hasn't seen much attention over the years, is that it isn't a problem most people run into (coordinators don't generally crash).

Copy link

@HariSekhon HariSekhon commented Nov 28, 2018

I agree coordinator processes or servers don't crash often but it's generally considered good practice to have high availability for those times when they do as you can't predict server hardware faults or administrative errors taking down the wrong server etc and most other technologies have considered it important enough to try to eliminate Single Points of Failure by designing for HA.

I appreciate you guys are already on it and that it's an item on the backlog - I understand that these things taken time and are limited by available resources. I'm not using Presto any more right now since I changed jobs but am hoping to use it again in future at which time it has I'm hoping there is HA :)

Thanks for the great work on Presto so far by the way, it was a pleasure to use :)

Copy link

@weldpua2008 weldpua2008 commented Feb 10, 2019

Hi Guys,
I want to know if anyone started to implement it?
I want to help to implement the feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

Successfully merging a pull request may close this issue.

8 participants