Skip to content
This repository was archived by the owner on Feb 4, 2021. It is now read-only.

Bug 1390320 - Allow Dataset api consumers to control the amount of parallelism used#6

Merged
maurodoglio merged 1 commit into
mozilla:masterfrom
thomcc:equally-sized-groups
Aug 29, 2017
Merged

Bug 1390320 - Allow Dataset api consumers to control the amount of parallelism used#6
maurodoglio merged 1 commit into
mozilla:masterfrom
thomcc:equally-sized-groups

Conversation

@thomcc
Copy link
Copy Markdown

@thomcc thomcc commented Aug 23, 2017

As it is it often forces a potentially very expensive shuffle in order to increase the parallelism. This was causing performance issues in a zeppelin notebook I was using for bug 1381641, where I'd get around 3 partitions and want to expand them to a much larger size.

It would probably be simpler without the threshold checking, but given it's current value, it seems like it might be intended to prevent something further down the line from hitting some signed integer overflow later, I wasn't sure.

I've made it opt-in, although it seems likely to be a better default than the current (if you ask me).

@codecov-io
Copy link
Copy Markdown

Codecov Report

Merging #6 into master will decrease coverage by 1.37%.
The diff coverage is 84.61%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master       #6      +/-   ##
==========================================
- Coverage   99.21%   97.84%   -1.38%     
==========================================
  Files           5        5              
  Lines         127      139      +12     
  Branches       20       23       +3     
==========================================
+ Hits          126      136      +10     
- Misses          1        3       +2
Impacted Files Coverage Δ
...ain/scala/com/mozilla/telemetry/heka/Dataset.scala 97.36% <66.66%> (-2.64%) ⬇️
...la/com/mozilla/telemetry/utils/ObjectSummary.scala 94.11% <90%> (-5.89%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96e42c6...f15dd47. Read the comment docs.

@mreid-moz mreid-moz requested a review from maurodoglio August 25, 2017 14:17
@maurodoglio
Copy link
Copy Markdown

Hey @thomcc thanks for the awesome patch.
I did a similar work for the python implementation of Dataset here. In general I like your solution as it solves the problem of having fewer partitions than executors so let's ship it. In the future we may want to change it to give priority to the number of files in a group rather than the total file size, but let's keep it for another day.

@maurodoglio
Copy link
Copy Markdown

Almost forgot, let's keep the current behaviour the default for now. Once we have verified on a real cluster that your code performs better we can remove the old code and remove the switch.

Copy link
Copy Markdown

@maurodoglio maurodoglio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 it

@maurodoglio maurodoglio merged commit 215abfc into mozilla:master Aug 29, 2017
@mreid-moz
Copy link
Copy Markdown

let's keep the current behaviour the default for now. Once we have verified
on a real cluster that your code performs better we can remove the old code
and remove the switch.

@maurodoglio Is there an issue/bug for this? If not, please file a follow-up so it remains on our radar.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants