Support table write commit in Presto on Spark #13854

wenleix · 2019-12-12T06:29:02Z

This is required by Presto-on-Spark (#13856) in case there is job failures/retry. Data written by failed tasks shouldn't be visible.

shixuan-fan

"Refactor HiveWriterFactory" LGTM
"Rename PageSinkProperties#isPartitionCommitRequired" Changing to commit is a little bit ambiguous because it might entangle with commit mechanism for transactions (TransactionalMetadata). Maybe isWriteCommitRequired? Honestly I don't have a good name. Also, the unsupported error message in connectors might also need to be changed to reflect the new API name

shixuan-fan

"Rename partition / lifespan commit into table write commit" LGTM
"Introduce ConnectorCommitStrategy" LGTM % nits

presto-main/src/main/java/com/facebook/presto/operator/TableCommitContext.java

presto-main/src/main/java/com/facebook/presto/operator/TableWriterOperator.java

shixuan-fan

"Support table write commit in Presto on Spark" Looks good with question

presto-main/src/main/java/com/facebook/presto/operator/TableWriterOperator.java

presto-main/src/main/java/com/facebook/presto/sql/planner/LocalExecutionPlanner.java

wenleix · 2020-04-18T21:11:37Z

@shixuan-fan :

"Rename PageSinkProperties#isPartitionCommitRequired" Changing to commit is a little bit ambiguous because it might entangle with commit mechanism for transactions (TransactionalMetadata). Maybe isWriteCommitRequired? Honestly I don't have a good name. Also, the unsupported error message in connectors might also need to be changed to reflect the new API name

That's a great point. When I think about the issue in 2019/12, I think even the old name PartitionCommit is a bit ambiguous. -- it's not really a "commit", but more like "make the data durable"( remember the classic ACID properties in Database) . However, at that time there is "partition" as the prefix , makes since less confusing (we know we are doing the "commit" at lifespan/partition granularity ) .

However, it's difficult to give it an appropriate name under the connector abstraction, as this "rename operation"/" is a bit lower level as connector abstraction level.

Now I relook at it after 4 months, Another way to model it as a two-stage commit. And this is the "commit" at the PageSink level:

The PageSink is a sink to the pages, and it has its own "commit" (or make the data visible to connector. Basically think it as you can append pages to the sink, however, before you commit the data to PageSink, they are visible to the connector (e.g. dot files in HDFS)
After the data is committed to PageSink, the engine will perform the final stage transaction commit so the data is visible to the engine.

In that case, PageSink#isCommitRequired seems a reasonable name (or we can call it PageSink#isSinkCommitRequired? ) , we might need some document/comment to explain there is two-stage commit: first stage is the PageSink level commit (which is optional) and the second stage is the Connector level commit (which is required).

presto-spi/src/main/java/com/facebook/presto/spi/connector/ConnectorCapabilities.java

presto-main/src/main/java/com/facebook/presto/operator/ConnectorCommitStrategy.java

arhimondr · 2020-04-22T15:57:30Z

@shixuan-fan @wenleix I like calling this concept as pageSinkCommit. I will do the renames.

arhimondr · 2020-04-22T16:51:04Z

@shixuan-fan @wenleix

Maybe isWriteCommitRequired? Honestly I don't have a good name. Also, the unsupported error message in connectors might also need to be changed to reflect the new API name

Since we decided to go with pageSinkCommit first I thought to rename it to PageSinkProperties#isPageSinkCommitRequired, but then it sounded a little bit tautological. So I decided to keep the name as PageSinkProperties#isCommitRequired. I did the renames in other, more ambiguous places, though.

Also, the unsupported error message in connectors might also need to be changed to reflect the new API name

Updated

shixuan-fan

LGTM

shixuan-fan · 2020-04-22T18:28:50Z

presto-main/src/main/java/com/facebook/presto/sql/planner/LocalExecutionPlanner.java

+            if (pageSinkCommitRequired) {
+                return TASK_COMMIT;
+            }
+            if (stageExecutionDescriptor.isRecoverableGroupedExecution()) {


Should we first check whether isRecoverableGroupedExecution is true and return LIFESPAN_COMMIT first? I feel like isRecoverableGroupedExecution is a stronger predicate than pageSinkCommitRequired. What do you think?

I agree. Let me do that.

arhimondr · 2020-04-23T04:25:14Z

@wenleix @shixuan-fan

The problem of this commit protocol is that the rename of all the files happens at the coordinator (or driver in Spark). In Presto it is not an issue, as Coordinator receives partition updates continuously as soon as TableWriter finishes writing. However in Spark all partition updates are delivered all at once only after the writing is completely finished. This creates unnecessary "hiccup" at the very end of the query, as coordinator has to rename thousands of files in a loop. This also creates additional stress on the file system, as very high number of files has to be renamed in a very short period of time.

I just had an interesting discussion with @sameeragarwal, and turns out Spark's commit protocol does not require files to be renamed on the driver. In spark the output file names are deterministic. As long as the target file name is the same across task attempts, tasks are allowed to speculatively "rename-overwrite" destination files without risk of introducing duplicated data.

It feels like ideally we would like to have the commit protocol similar to one that Spark has. I wonder if we still want to have the current approach as a temporary transitional solution?

Thoughts?

arhimondr · 2020-04-23T04:40:27Z

@sameeragarwal

From what I understand Spark supports dynamic partitions. What happens if the partition key is non deterministic?

For example, first run adds a file to partition 'p1' and does commit. Second run doesn't add any files to partition 'p1', but instead adds some files to 'p2'?

sameeragarwal · 2020-04-23T18:41:56Z

@arhimondr It'll break -- we require partition keys to be deterministic (not just for the commit protocol, but for general task retries as well)

shixuan-fan · 2020-04-27T17:29:12Z

@arhimondr Just curious, is it possible to amortize these renames by committing at the time of receiving the page from table writer, rather than commit all files after receiving everything from table writer?

arhimondr · 2020-04-27T17:32:34Z

@shixuan-fan That's what we do in conventional Presto. On Spark however the results from the upstream stage are delivered all at once, when the upstream stage is finished =\

Minor variable renames

Page sink commit mechanism is a general connector capability and is not restricted only for partition commit.

It can be used not only to commit lifespans or physical partitions. In fact it can be used to commit any page sink write.

Co-authored-by: Andrii Rosa <andriirosa@fb.com>

Tasks in spark are often retried and run speculatively, thus the commit protocol required for table writes to avoid data corruption Co-authored-by: Andrii Rosa <andriirosa@fb.com>

wenleix · 2020-04-30T04:52:03Z

@shixuan-fan :

@arhimondr Just curious, is it possible to amortize these renames by committing at the time of receiving the page from table writer, rather than commit all files after receiving everything from table writer?

Also we did this in Presto to reduce the query latency and reduce coordinator pressure, in Spark:

Each query has its own driver (coordinator)
Latency is not that sensitive... (thinking Presto-on-Spark aims at queries requires > 1 hour to run, otherwise, user should use Presto-on-Presto)

arhimondr force-pushed the presto-on-spark-tbl-poc branch from fb8b1c3 to 5057d25 Compare April 17, 2020 16:15

arhimondr changed the title ~~[POC] Support table write in Presto-on-Spark with Spark task failure~~ Support table write commit in Presto on Spark Apr 17, 2020

arhimondr requested a review from shixuan-fan April 17, 2020 16:16

shixuan-fan reviewed Apr 17, 2020

View reviewed changes

presto-main/src/main/java/com/facebook/presto/operator/TableWriterOperator.java Outdated Show resolved Hide resolved

presto-main/src/main/java/com/facebook/presto/sql/planner/LocalExecutionPlanner.java Show resolved Hide resolved

arhimondr force-pushed the presto-on-spark-tbl-poc branch from 5057d25 to f5f3e6c Compare April 17, 2020 20:19

wenleix commented Apr 18, 2020

View reviewed changes

presto-spi/src/main/java/com/facebook/presto/spi/connector/ConnectorCapabilities.java Outdated Show resolved Hide resolved

wenleix commented Apr 18, 2020

View reviewed changes

presto-main/src/main/java/com/facebook/presto/operator/ConnectorCommitStrategy.java Outdated Show resolved Hide resolved

arhimondr force-pushed the presto-on-spark-tbl-poc branch from f5f3e6c to dcba312 Compare April 22, 2020 17:11

shixuan-fan approved these changes Apr 22, 2020

View reviewed changes

wenleix and others added 5 commits April 27, 2020 15:23

Refactor HiveWriterFactory

ba21813

Minor variable renames

Rename PageSinkProperties#isPartitionCommitRequired

c7853f8

Page sink commit mechanism is a general connector capability and is not restricted only for partition commit.

Rename partition / lifespan commit into page sink commit

d162cef

It can be used not only to commit lifespans or physical partitions. In fact it can be used to commit any page sink write.

Introduce PageSinkCommitStrategy

2a13b7e

Co-authored-by: Andrii Rosa <andriirosa@fb.com>

Support table write commit in Presto on Spark

8452b97

Tasks in spark are often retried and run speculatively, thus the commit protocol required for table writes to avoid data corruption Co-authored-by: Andrii Rosa <andriirosa@fb.com>

arhimondr force-pushed the presto-on-spark-tbl-poc branch from dcba312 to 8452b97 Compare April 27, 2020 19:31

arhimondr merged commit 2bfddd0 into prestodb:master Apr 27, 2020

caithagoras mentioned this pull request May 4, 2020

Add release notes for 0.235 #14476

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support table write commit in Presto on Spark #13854

Support table write commit in Presto on Spark #13854

wenleix commented Dec 12, 2019 •

edited

Loading

shixuan-fan left a comment •

edited

Loading

shixuan-fan left a comment

shixuan-fan left a comment

wenleix commented Apr 18, 2020 •

edited

Loading

arhimondr commented Apr 22, 2020

arhimondr commented Apr 22, 2020 •

edited

Loading

shixuan-fan left a comment

shixuan-fan Apr 22, 2020

arhimondr Apr 22, 2020

arhimondr commented Apr 23, 2020

arhimondr commented Apr 23, 2020

sameeragarwal commented Apr 23, 2020

shixuan-fan commented Apr 27, 2020

arhimondr commented Apr 27, 2020

wenleix commented Apr 30, 2020

Support table write commit in Presto on Spark #13854

Support table write commit in Presto on Spark #13854

Conversation

wenleix commented Dec 12, 2019 • edited Loading

shixuan-fan left a comment • edited Loading

Choose a reason for hiding this comment

shixuan-fan left a comment

Choose a reason for hiding this comment

shixuan-fan left a comment

Choose a reason for hiding this comment

wenleix commented Apr 18, 2020 • edited Loading

arhimondr commented Apr 22, 2020

arhimondr commented Apr 22, 2020 • edited Loading

shixuan-fan left a comment

Choose a reason for hiding this comment

shixuan-fan Apr 22, 2020

Choose a reason for hiding this comment

arhimondr Apr 22, 2020

Choose a reason for hiding this comment

arhimondr commented Apr 23, 2020

arhimondr commented Apr 23, 2020

sameeragarwal commented Apr 23, 2020

shixuan-fan commented Apr 27, 2020

arhimondr commented Apr 27, 2020

wenleix commented Apr 30, 2020

wenleix commented Dec 12, 2019 •

edited

Loading

shixuan-fan left a comment •

edited

Loading

wenleix commented Apr 18, 2020 •

edited

Loading

arhimondr commented Apr 22, 2020 •

edited

Loading