Skip to content
This repository was archived by the owner on Jan 22, 2026. It is now read-only.

Bug 1349608 - Improve Dataset handling of small pings#127

Merged
maurodoglio merged 1 commit into
masterfrom
group-by-size-min
Mar 28, 2017
Merged

Bug 1349608 - Improve Dataset handling of small pings#127
maurodoglio merged 1 commit into
masterfrom
group-by-size-min

Conversation

@maurodoglio
Copy link
Copy Markdown
Contributor

@maurodoglio maurodoglio commented Mar 27, 2017

The idea here is to use a greedy algorithm to balance the RDD partition
in terms of number of files (first) and total size of each partition.

@maurodoglio maurodoglio force-pushed the group-by-size-min branch 2 times, most recently from dbaf940 to daf5526 Compare March 27, 2017 16:38
@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.1%) to 68.486% when pulling daf5526 on group-by-size-min into d52da12 on master.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.1%) to 68.486% when pulling daf5526 on group-by-size-min into d52da12 on master.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.1%) to 68.486% when pulling daf5526 on group-by-size-min into d52da12 on master.

@maurodoglio
Copy link
Copy Markdown
Contributor Author

maurodoglio commented Mar 27, 2017

I ran a few comparison tests on 2 different queries:

  • 10% of every ping with docType = OTHER
    • Time before change: 16min 40s
    • Time after change: 10min 43s (-34.7%)
  • 10% of main pings for a given submissionDate
    • Time before change: 35min 20s
    • Time after change: 22min 47s (-35.5%)

As a bonus point the logic in the grouping function is now simpler (and more readable).
I also tried a slightly more sophisticated implentation before this one, which was always selecting the smallest partition as the destination for a file. That didn't perform as well as this one, the execution time was half way between the original implementation and the current one.

The idea here is to use a greedy algorithm to balance the RDD partition
in terms of number of files (first) and total size of each partition.
@maurodoglio maurodoglio self-assigned this Mar 27, 2017
@maurodoglio maurodoglio requested a review from mreid-moz March 27, 2017 17:24
@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.1%) to 68.486% when pulling dc2ce0a on group-by-size-min into d52da12 on master.

Copy link
Copy Markdown
Contributor

@mreid-moz mreid-moz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@maurodoglio
Copy link
Copy Markdown
Contributor Author

I forgot to mention, to make the comparison more correct, I bypassed the sampling provided by .records and instead I used the same set of pre-sampled summaries for both. I can provide the sampled data as json if anybody is interested.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants