[Design] Presto-on-Spark: A Tale of Two Computation Engines #13856

wenleix · 2019-12-12T18:31:26Z

Links/Resources:

Slides: https://www.slideshare.net/databricks/presto-on-apache-spark-a-tale-of-two-computation-engines
Spark Summit Talk: https://www.youtube.com/watch?v=obLnTw_pyDw
Detailed Design Doc: https://docs.google.com/document/d/1aURQWDY1NJZ7xPS6jnsFcQY7pOIqHGkHzJrcPu-42Tk

Abstract

The architecture tradeoff between MapReduce and parallel database has been an open discussion since the dawn of MapReduce system over a decade ago. At Facebook, we have been spent past several years in scaling Presto to Facebook-scale batch workload.

Presto Unlimited aims at solving such scalability challenges. After revisiting the key architecture change (e.g. disaggregated shuffle) required to further scale Presto, we decided Presto-on-Spark as the path to further scale Presto. See the rest of the design doc for details.

We believe this is only a first step towards more confluence between the Spark and the Presto communities, and a major step towards enabling unified SQL experience between interactive and batch use cases.

Introduction

Presto was originally designed for interactive queries but has evolved into a unified engine for both interactive and batch use cases. Scaling an MPP architecture database to batch data processing over Internet-scale datasets is known to be an extremely difficult problem [1].

Presto Unlimited aims at solving such scalability challenges. To truly scale Presto Unlimited to Internet-scale batch workloads we need the following (excluding coordinator scaling and spilling):

Scales shuffle. This requires to either implement MapReduce-style shuffle or integrate with a disaggregated shuffle service such as Cosco.
Scales Presto worker execution. This includes resource isolation, straggler detection, speculative execution, etc.
Scales Presto resource management. A fine grained resource management is required when a single query can take years of CPU. Such concept is known as Mapper/Reducer in MapReduce, executor in Spark, and lifespan in Presto, similar to YARN/Mesos.

We realized these work lays down the foundation for a general-purpose parallel data processing system, such as Spark, FlumeJava, Dryad. Note such data processing system has its own usage and well-defined programming abstraction, and requires years to mature.

We found Presto should leverage existing well-developed systems to scale to large batch workload, instead of “embedding” such a system inside Presto. We also believe such collaboration would help the whole Big Data community to better understand the abstraction between SQL engine and data processing system, as well as evolve and refine the execution primitives to provide near-optimal performance without sacrificing the abstractions.

We choose to leverage Spark as the parallel data processing system to further scale Presto Unlimited as it’s the most widely used open source system in this category. However, the design and architecture here should apply to any other parallel data processing system as well.

Architecture

Presto Planner needs to know it’s generating plan for Spark execution, and can thus reduce unnecessary nodes (e.g. LocalExchange)
On Spark worker, it includes:
Construct operator factory chain (a.k.a DriverFactory) through LocalExecutionPlanner
Instatinate driver by binding the input split, and run the driver
Send the data to a SparkOutputBuffer which will emit to Spark.

mahengyang · 2019-12-13T08:54:07Z

excellent job! A unify entry for batch data processing and ad-hoc is very import for user. spark,hive,flink,mysql,elasticsearch,mongodb and so on, some is for calculate, and other is for store data, but user could connect them through Presto!

wenleix · 2019-12-18T22:58:43Z

TODOs (for tracking purpose, keep updating):

Memory related config refactor: Presto on Spark Initial Commit #13760 (comment)
Refactor Map<String, Iterator<Tuple2<Integer, byte[]>>> -- basically maps from PlanNodeId to a scala tuple as reducer inputs: Presto on Spark Initial Commit #13760 (comment)
I understand I originally suggest the package name presto-spark-classloader-interface. But now I revisit it, maybe presto-spark-common is better since there are also some common classes shared between presto-spark and presto-spark-launcher.
Reuse serialized byte array in OutputBuffer: Presto on Spark Initial Commit #13760 (comment)
Refactor SparkRddFactory more close to SqlQueryScheduler#createStageExecutions : Presto on Spark Initial Commit #13760 (comment)

arhimondr · 2019-12-20T22:50:28Z

presto-spark-classloader-interface

As per earlier discussion, we decided to go with this name explicitly to emphasize that this module is only needed for the classloader isolation, and not for anything fundamental. Once Spark supports classloader isolation internally (or once it is migrated to Java 9+ that supports Java modules), this artificial module should be removed.

wenleix · 2019-12-24T08:19:18Z

@arhimondr :

I see. But I do see we might also want to put some common classes into classloader-interface package, I think TaskProcessors is already there. See for example
#13760 (comment) , always use serialized byte array makes code more difficult to understand

wubiaoi · 2019-12-31T09:50:25Z

What's the difference between doing this and sparksql?

wenleix · 2019-12-31T19:23:47Z

@wubiaoi : From user experience perspective, Presto-on-Spark will provide the exact language and semantic between interactive and batch. While both Presto and SparkSQL is ANSI-SQL compatible, note there is no “ANSI SQL” as a language: ANSI SQL is an (in some way loose) specification. Many SQL dialects are claimed to be ANSI SQL compatible (notably, Oracle, SQL Server and DB2), yet they are significantly incompatible with each other.

As more details explained in this Quora answer:

ANSI SQL is a specification, not a particular product. It's a document, describing the official features of the SQL language.

Every brand of SQL RDBMS implements a subset of ANSI SQL. Every brand of SQL RDBMS I'm aware of adds some features to the language that are not in the ANSI SQL specification (example: indexes). And each brand implements features in its own way, not necessarily compatible with the others.

Even the language and semantic can be exactly the same, Presto-on-Spark provides unified SQL experience for interactive and batch use case. The unified SQL experience means not only the SQL language and semantic is the same, but the experience should also be similar. This is because while SQL is originally designed to be a declarative language, in almost all practice, user depends on engine-specific implementation details, and use it as imperative language in some part, to get the best performance. The SQL experience includes, but not limited to:

Semantics (e.g. NULL handle)
Subtle behavior (e.g. the maximum array/map can be handled, emit NULL vs. throw exception)
Language hint
UDF experience
How the plan will be optimized
How the SQL will be executed (e.g. performance implication for different way to write SQL, using UNNEST vs. lambda)

I will explain the technical perspective in a separate comment :)

wenleix · 2019-12-31T20:20:46Z

@wubiaoi : From technical perspective, SparkSQL execution model is row-oriented + whole stage codegen[1], while Presto execution model is columnar processing + vectorization. So architecture-wise Presto-on-Spark will be more similar to the early research prototype Shark [2].

The design trade-offs between row-oriented + whole stage codegen vs. columnar processing + vectorization deserves a very long discussion , I will let @oerling to provide more insights :) . However, with modern Big Data where denormalization is omnipresent, we do see an ever-increasing value of columnar processing + vectorization [3]

[1] Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
[2] Shark: SQL and Rich Analytics at Scale: https://cs.stanford.edu/~matei/papers/2013/sigmod_shark.pdf
[3] Everything You Always Wanted To Do in Table Scan: https://prestodb.io/blog/2019/06/29/everything-you-always-wanted-to-do-in-a-table-scan

wubiaoi · 2020-01-02T03:43:45Z

@wenleix 👍 Thank you very much for the explanation.
Is it better to run only the costly stage on spark?

wenleix · 2020-01-08T19:24:48Z

@wubiaoi : While this is certainly possible, this complicates the execution a lot as it requires coordination between two (heterogeneous) execution engine. Also, why not use Presto Unlimited in this case? :)

KannarFr · 2020-09-24T14:02:02Z

As I'm looking for HTTP service instead of spark-submit I can work on it. But now you get what I want to do, right? WDYT about it?

arhimondr · 2020-09-24T14:07:42Z

Classic presto acts like a service, with an HTTP endpoint to fetch the results. Are you hitting the scalability wall with the classic presto?

As I'm looking for HTTP service instead of spark-submit I can work on it. But now you get what I want to do, right? WDYT about it?

I'm not sure if Spark even supports gradual fetching of the results. You can investigate it. But currently we are collecting results via the collect call, that returns all the results all at once.

As a middle ground you can change your workload to slightly different

Run INSERT INTO tmp_table .... in Presto on Spark that will write the results into a temporary table
Run SELECT * FROM tmp_table in classic Presto to fetch the results

Generally speaking Presto on Spark is mostly designed to run insert queries, that's why we don't care much about returning the results.

KannarFr · 2020-09-24T14:33:58Z

Presto on Spark allows changing the catalog for each query by creating Presto runner for each query, correct? Classic Presto does not support to load/unload catalogs: #12605.

My main goal is to provide context (presto catalog) for classic Presto for each query. But in fact, we need to support a very high scale. And I found this project that seems to match my requirements.

arhimondr · 2020-09-24T14:36:07Z

Could you please describe your usecase a little bit more? Maybe there's a better way to achieve this dynamic catalog behaviour?

KannarFr · 2020-09-24T14:48:22Z

Considering millions of catalogs of different types (mysql, psql, ...). Thousands of clients. So a lot of queries.

A client comes with its catalog and query on an HTTP service.

This service sends to Presto/Presto-on-Spark (let's call it system) catalog list and query to run.

Then the system should run the query and stream results through HTTP chunks to limit RAM usage if possible to answer to the client.

This is the use case, it seems simple but its implementation is not.

wenleix · 2020-09-25T21:18:52Z

@KannarFr : From operation/service perspective, Presto-on-Spark is more like Spark. Thus in my opinion we should leverage what Spark provides for such service (instead of thinking it in the Presto coordinator way).

djiangc · 2020-11-22T18:20:05Z

@arhimondr @wenleix Is it possible to run multiple SQL queries in the query file?

arhimondr · 2020-11-25T13:59:40Z

@djiangc Unfortunately no. But that should be an easy feature to add.

djiangc · 2020-11-26T20:55:26Z

@arhimondr @wenleix another question. It seems I can't use cluster deploy-mode on spark-submit for presto-spark-launcher, only client is supported. Is this true or am I missing something?

arhimondr · 2020-11-30T16:15:18Z

@djiangc Yes, currently only the client mode is supported.

djiangc · 2020-12-04T01:15:03Z

@arhimondr thanks for your response. I have another question, can I do insert overwrite?
set session hive.insert_existing_partitions_behavior='OVERWRITE';insert into test3 select *,2 from test

arhimondr · 2020-12-04T03:06:17Z

@djiangc Currently the launcher doesn't support setting session properties. You must enable the OVERWRITE behaviour with a configuration property: https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java#L561

Also it should be pretty easy to add parameter to the launcher that accepts session properties.

djiangc · 2020-12-04T17:35:40Z

many thanks for your help and pointer, I got the partition overwrite working with spark-presto-launcher @arhimondr

JituS · 2021-02-04T22:41:42Z

@arhimondr I am not able to run insert in overwrite mode by setting above property. Is it not supported with s3?
@djiangc Are you using s3 or hdfs?
Getting bellow exception:
java.lang.IllegalStateException: Overwriting existing partition doesn't support DIRECT_TO_TARGET_EXISTING_DIRECTORY write mode

rguillome · 2021-04-12T16:54:16Z

Hi,

@arhimondr I understand the philosophy behind this sentence

Generally speaking Presto on Spark is mostly designed to run insert queries

but the insert needs a predefined destination table with a schema, format, location, right?

As an AWS user, what I would find very usefull is to write the result of a presto-on-spark Select to an S3 location and run a Glue crawler on that location to have the table and the infered schema automatically created.

Maybe a CLI argument configuring the dataOutputLocation would do the trick ?

arhimondr · 2021-04-14T20:18:22Z

@rguillome Hi! Thanks for reaching out.

In our case we know the output schema in advance, thus we always ending up running INSERT INTO ... an existing table. If the schema is unknown for your usecase did you consider running CREATE TABLE AS SELECT ... to create a temporary table with a well defined schema?

rguillome · 2021-04-15T06:55:41Z

Hi @arhimondr

I was trying to CREATE TABLE AS SELECT ... with an external location but encountered this line in the presto-hive HiveMetadata.beginCreateTable method:

if (getExternalLocation(tableMetadata.getProperties()) != null) { throw new PrestoException(NOT_SUPPORTED, "External tables cannot be created using CREATE TABLE AS"); }

So basically I will try to push a MR with those current changes already made in trinodb

I wonder if the the ultimate solution should'nt be an option to write each final split to a hdfs or S3 location directly to avoid the gathering at the driver level. We could imagine having all the benefits of Hadoop FS organisation (partionning, bucketing, sort and splits). But I'm not already cumfortable with all the details that It would involve to dig into this for now.

GithubZhitao · 2021-10-13T08:27:42Z

Is the presto-on-spark's physical plan be applied by DynamicFilter(vs DynamicPartitionPrune) ?

huleilei · 2021-11-16T08:09:53Z

@wubiaoi : From technical perspective, SparkSQL execution model is row-oriented + whole stage codegen[1], while Presto execution model is columnar processing + vectorization. So architecture-wise Presto-on-Spark will be more similar to the early research prototype Shark [2].

The design trade-offs between row-oriented + whole stage codegen vs. columnar processing + vectorization deserves a very long discussion , I will let @oerling to provide more insights :) . However, with modern Big Data where denormalization is omnipresent, we do see an ever-increasing value of columnar processing + vectorization [3]

[1] Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html [2] Shark: SQL and Rich Analytics at Scale: https://cs.stanford.edu/~matei/papers/2013/sigmod_shark.pdf [3] Everything You Always Wanted To Do in Table Scan: https://prestodb.io/blog/2019/06/29/everything-you-always-wanted-to-do-in-a-table-scan

SparkSQL 3.0+ execution model is aslo columnar processing + vectorization

GithubZhitao · 2021-12-24T10:46:35Z

Is the presto-on-spark's physical plan be applied by DynamicFilter(vs DynamicPartitionPrune) ?

It is !

whutpencil · 2022-02-22T08:18:15Z

@wenleix Hello, I have a question. Although the compatibility is increased, for queries with small amount of data, isn't the query speed slowed down after adding materialized shuffle? At the same time, I would like to ask how the improved Presto and sparksql compare in terms of a large amount of data?

rongrong · 2022-02-26T01:55:38Z

@wenleix Hello, I have a question. Although the compatibility is increased, for queries with small amount of data, isn't the query speed slowed down after adding materialized shuffle? At the same time, I would like to ask how the improved Presto and sparksql compare in terms of a large amount of data?

The idea is to run small queries on classic Presto, and run large (won't fit within memory limit) / long running queries (more likely to be affected by cluster stability issues) using Presto-on-Spark.

GithubZhitao · 2022-02-27T14:45:50Z

@rongrong
One more question. How presto-on-spark deal with the large amount of data transportation when execute large queries.
As I know, data transport by broadcast machanism. Will all the these moved data go through from spark-driver, which is a single point to coordinate all global data streams. Any bottleneck ?

whutpencil · 2022-02-28T07:04:27Z

@rongrong Does this mean that if the user fails to execute through Presto and finds that the SQL is a large query, then submit it through Presto on spark? Does the user have a process of switching the submission method?
At first, I mistakenly thought that all Presto queries were submitted through Presto on spark.

GithubZhitao · 2022-03-04T05:23:36Z

@rongrong Does this mean that if the user fails to execute through Presto and finds that the SQL is a large query, then submit it through Presto on spark? Does the user have a process of switching the submission method?
At first, I mistakenly thought that all Presto queries were submitted through Presto on spark.

As I know, these are totally two processes; You must develop the judge logic to decide whether it is a large query.
Presto-on-spark is exactly a spark process if ignoring the presto's code logic. Either has nothing to do with other. @whutpencil

476474988 · 2022-04-16T10:08:17Z

use same sql. does the presto-on-spark use less memory and more times?

arhimondr mentioned this issue Dec 12, 2019

Presto on Spark Initial Commit #13760

Merged

wenleix mentioned this issue Dec 13, 2019

Support table write commit in Presto on Spark #13854

Merged

wenleix changed the title ~~[Design] Presto-on-Spark~~ [Design] Presto-on-Spark: A Tale of Two Computation Engines Dec 16, 2019

wenleix pinned this issue Dec 18, 2019

wenleix added the Roadmap A top level roadmap item label Dec 18, 2019

wenleix mentioned this issue Dec 24, 2019

[Reference and Test Only] Refactor SparkRddPlanner and more #13895

Closed

This was referenced May 1, 2020

Make presto-spark extensible to additional modules #13971

Merged

Log number of splits received for Presto-on-Spark task #14063

Merged

Disable Presto-on-Spark query stats collection #14260

Merged

Support other security mechanism in Presto-on-Spark #14242

Merged

This was referenced May 1, 2020

Run presto spark tests in forked VM #14097

Merged

Fix hostname related assertion failure in PrestoSparkQueryRunner #14081

Merged

Do not run docker based spark integration tests by default #14115

Merged

wenleix mentioned this issue May 1, 2020

Implement row base exchange in Presto on Spark #14099

Merged

wenleix mentioned this issue May 22, 2020

Run AbstractTestQueries suite with Presto on Spark #14515

Merged

arhimondr mentioned this issue Jun 3, 2020

Integrate Presto on Spark with the TaskExecutor #14522

Merged

rohanpednekar mentioned this issue May 7, 2021

What's the difference between Presto and spark? #15986

Open

tdcmeehan added this to the 2021H2 milestone Jul 20, 2021

tdcmeehan mentioned this issue Aug 5, 2021

Support Java 16 #16268

Open

tdcmeehan unpinned this issue Apr 19, 2022

[Design] Presto-on-Spark: A Tale of Two Computation Engines #13856

[Design] Presto-on-Spark: A Tale of Two Computation Engines #13856

Comments

wenleix commented Dec 12, 2019 • edited Loading

Links/Resources:

Abstract

Introduction

Architecture

mahengyang commented Dec 13, 2019 • edited Loading

wenleix commented Dec 18, 2019 • edited Loading

arhimondr commented Dec 20, 2019

wenleix commented Dec 24, 2019 • edited Loading

wubiaoi commented Dec 31, 2019

wenleix commented Dec 31, 2019 • edited Loading

wenleix commented Dec 31, 2019 • edited Loading

wubiaoi commented Jan 2, 2020

wenleix commented Jan 8, 2020

KannarFr commented Sep 24, 2020

arhimondr commented Sep 24, 2020

KannarFr commented Sep 24, 2020

arhimondr commented Sep 24, 2020

KannarFr commented Sep 24, 2020 • edited Loading

wenleix commented Sep 25, 2020

djiangc commented Nov 22, 2020 • edited Loading

arhimondr commented Nov 25, 2020

djiangc commented Nov 26, 2020

arhimondr commented Nov 30, 2020

djiangc commented Dec 4, 2020 • edited Loading

arhimondr commented Dec 4, 2020

djiangc commented Dec 4, 2020

JituS commented Feb 4, 2021

rguillome commented Apr 12, 2021

arhimondr commented Apr 14, 2021

rguillome commented Apr 15, 2021 • edited Loading

GithubZhitao commented Oct 13, 2021

huleilei commented Nov 16, 2021

GithubZhitao commented Dec 24, 2021

whutpencil commented Feb 22, 2022

rongrong commented Feb 26, 2022

GithubZhitao commented Feb 27, 2022

whutpencil commented Feb 28, 2022

GithubZhitao commented Mar 4, 2022 • edited Loading

476474988 commented Apr 16, 2022

wenleix commented Dec 12, 2019 •

edited

Loading

mahengyang commented Dec 13, 2019 •

edited

Loading

wenleix commented Dec 18, 2019 •

edited

Loading

wenleix commented Dec 24, 2019 •

edited

Loading

wenleix commented Dec 31, 2019 •

edited

Loading

wenleix commented Dec 31, 2019 •

edited

Loading

KannarFr commented Sep 24, 2020 •

edited

Loading

djiangc commented Nov 22, 2020 •

edited

Loading

djiangc commented Dec 4, 2020 •

edited

Loading

rguillome commented Apr 15, 2021 •

edited

Loading

GithubZhitao commented Mar 4, 2022 •

edited

Loading