Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize MergingPageOutput #15052

Open
yingsu00 opened this issue Aug 19, 2020 · 3 comments
Open

Optimize MergingPageOutput #15052

yingsu00 opened this issue Aug 19, 2020 · 3 comments

Comments

@yingsu00
Copy link
Contributor

In Facebook workloads, we observed MergingPageOutput.getOutput() was allocating large amount of memory, in which BlockBuilders keep increasing the arrays after appending some rows. It also creates lots of Slices just to read the data from the source. This can be avoided because the positions in the pages being merged is already known and we don't need to grow memory or create insane amount of Slices on the fly. Instead of appending the values one by one, we can make the BlockBuilders be able to append an existing block internally. This should also benefit the performance.

@yingsu00
Copy link
Contributor Author

cc @mbasmanova @bhhari @oerling

@mbasmanova
Copy link
Contributor

CC: @arhimondr

@yingsu00
Copy link
Contributor Author

Just discussed with @oerling about this. Orri's point is the MergingPageOutput cost from the small pages might well over the benefit of merging them to larger pages. Moreover, if the MergingPageOutput is followed by a repartition(which might be most of the cases), the effect of small pages will be gone anyways. Therefore, a more straightforward and time saving way is to remove MergingPageOutput whenever it's followed by repartition. I will do performance testing afterwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants