-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support table write commit in Presto on Spark #13854
Conversation
fb8b1c3
to
5057d25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Refactor HiveWriterFactory" LGTM
"Rename PageSinkProperties#isPartitionCommitRequired" Changing to commit
is a little bit ambiguous because it might entangle with commit
mechanism for transactions (TransactionalMetadata
). Maybe isWriteCommitRequired
? Honestly I don't have a good name. Also, the unsupported error message in connectors might also need to be changed to reflect the new API name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Rename partition / lifespan commit into table write commit" LGTM
"Introduce ConnectorCommitStrategy" LGTM % nits
presto-main/src/main/java/com/facebook/presto/operator/TableCommitContext.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/operator/TableWriterOperator.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/operator/TableWriterOperator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Support table write commit in Presto on Spark" Looks good with question
presto-main/src/main/java/com/facebook/presto/operator/TableWriterOperator.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/sql/planner/LocalExecutionPlanner.java
Show resolved
Hide resolved
5057d25
to
f5f3e6c
Compare
That's a great point. When I think about the issue in 2019/12, I think even the old name However, it's difficult to give it an appropriate name under the connector abstraction, as this "rename operation"/" is a bit lower level as connector abstraction level. Now I relook at it after 4 months, Another way to model it as a two-stage commit. And this is the "commit" at the
In that case, |
presto-spi/src/main/java/com/facebook/presto/spi/connector/ConnectorCapabilities.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/operator/ConnectorCommitStrategy.java
Outdated
Show resolved
Hide resolved
@shixuan-fan @wenleix I like calling this concept as |
Since we decided to go with
Updated |
f5f3e6c
to
dcba312
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
if (pageSinkCommitRequired) { | ||
return TASK_COMMIT; | ||
} | ||
if (stageExecutionDescriptor.isRecoverableGroupedExecution()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we first check whether isRecoverableGroupedExecution
is true
and return LIFESPAN_COMMIT
first? I feel like isRecoverableGroupedExecution
is a stronger predicate than pageSinkCommitRequired
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Let me do that.
The problem of this commit protocol is that the rename of all the files happens at the coordinator (or driver in Spark). In Presto it is not an issue, as Coordinator receives partition updates continuously as soon as TableWriter finishes writing. However in Spark all partition updates are delivered all at once only after the writing is completely finished. This creates unnecessary "hiccup" at the very end of the query, as coordinator has to rename thousands of files in a loop. This also creates additional stress on the file system, as very high number of files has to be renamed in a very short period of time. I just had an interesting discussion with @sameeragarwal, and turns out Spark's commit protocol does not require files to be renamed on the driver. In spark the output file names are deterministic. As long as the target file name is the same across task attempts, tasks are allowed to speculatively "rename-overwrite" destination files without risk of introducing duplicated data. It feels like ideally we would like to have the commit protocol similar to one that Spark has. I wonder if we still want to have the current approach as a temporary transitional solution? Thoughts? |
From what I understand Spark supports dynamic partitions. What happens if the partition key is non deterministic? For example, first run adds a file to partition 'p1' and does commit. Second run doesn't add any files to partition 'p1', but instead adds some files to 'p2'? |
@arhimondr It'll break -- we require partition keys to be deterministic (not just for the commit protocol, but for general task retries as well) |
@arhimondr Just curious, is it possible to amortize these renames by committing at the time of receiving the page from table writer, rather than commit all files after receiving everything from table writer? |
@shixuan-fan That's what we do in conventional Presto. On Spark however the results from the upstream stage are delivered all at once, when the upstream stage is finished =\ |
Minor variable renames
Page sink commit mechanism is a general connector capability and is not restricted only for partition commit.
It can be used not only to commit lifespans or physical partitions. In fact it can be used to commit any page sink write.
Co-authored-by: Andrii Rosa <andriirosa@fb.com>
Tasks in spark are often retried and run speculatively, thus the commit protocol required for table writes to avoid data corruption Co-authored-by: Andrii Rosa <andriirosa@fb.com>
dcba312
to
8452b97
Compare
Also we did this in Presto to reduce the query latency and reduce coordinator pressure, in Spark:
|
This is required by Presto-on-Spark (#13856) in case there is job failures/retry. Data written by failed tasks shouldn't be visible.