Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail][performance] Fix literals lazy decoding #6605

Merged
merged 8 commits into from Jul 11, 2019

Conversation

@tpoterba
Copy link
Collaborator

commented Jul 10, 2019

Cloud providers don't want you to know this one weird trick to save
thousands on your compute bill

This PR:

Running matrix_table_array_arithmetic...
    run 1: 18.28
    run 2: 15.26
    run 3: 14.76

Master:

Running matrix_table_array_arithmetic...
    (I waited 30 minutes and the first partition still hadn't finished...)

Lower bound: 900x faster

[hail][performance] Fix literals lazy decoding
Cloud providers don't want you to know this one weird trick to save
thousands on your compute bill

tpoterba added some commits Jul 11, 2019

fix

@tpoterba tpoterba added the prio:high label Jul 11, 2019

tpoterba added some commits Jul 11, 2019

@tpoterba

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 11, 2019

This change fixes a huge problem caused by these lines of code and context:

https://github.com/hail-is/hail/pull/6605/files#diff-1278c1788239002cc63ccb82cbef8d76L190

The problem is that in our generated code, every literal is decoded each time any literal is referenced. This is extremely expensive!

In this change, we instead decode the literals once with the function is constructed from the partition index (used with randomness), by adding a new region argument which the literals are decoded into. This region must live as long as the RegionValues returned by any invocation of the function.

The primary error mode I might expect is that we use the wrong region to generate the function, causing use-after-free errors. These are well-covered by tests, since I had a few of these bugs and fixed them due to test failures. The region we shouldn't be using is ctx.region, which refers to RVDContext.region, the per-row region that is cleared after each record.

ctx.r (the global execution context region) and ctx.freshRegion (a partition-owned global region, generally named globalRegion or partRegion) are safe.

@danking danking merged commit 02f4e10 into hail-is:master Jul 11, 2019

1 check passed

ci-test success
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.