You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Feb 4, 2021. It is now read-only.
As it is it often forces a potentially very expensive shuffle in order to increase the parallelism. This was causing performance issues in a zeppelin notebook I was using for bug 1381641, where I'd get around 3 partitions and want to expand them to a much larger size.
It would probably be simpler without the threshold checking, but given it's current value, it seems like it might be intended to prevent something further down the line from hitting some signed integer overflow later, I wasn't sure.
I've made it opt-in, although it seems likely to be a better default than the current (if you ask me).
Hey @thomcc thanks for the awesome patch.
I did a similar work for the python implementation of Dataset here. In general I like your solution as it solves the problem of having fewer partitions than executors so let's ship it. In the future we may want to change it to give priority to the number of files in a group rather than the total file size, but let's keep it for another day.
Almost forgot, let's keep the current behaviour the default for now. Once we have verified on a real cluster that your code performs better we can remove the old code and remove the switch.
let's keep the current behaviour the default for now. Once we have verified
on a real cluster that your code performs better we can remove the old code
and remove the switch.
@maurodoglio Is there an issue/bug for this? If not, please file a follow-up so it remains on our radar.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
None yet
4 participants
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
As it is it often forces a potentially very expensive shuffle in order to increase the parallelism. This was causing performance issues in a zeppelin notebook I was using for bug 1381641, where I'd get around 3 partitions and want to expand them to a much larger size.
It would probably be simpler without the threshold checking, but given it's current value, it seems like it might be intended to prevent something further down the line from hitting some signed integer overflow later, I wasn't sure.
I've made it opt-in, although it seems likely to be a better default than the current (if you ask me).