-
Notifications
You must be signed in to change notification settings - Fork 117
GloWGR scaling improvements #282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #282 +/- ##
==========================================
- Coverage 93.76% 93.64% -0.13%
==========================================
Files 94 92 -2
Lines 4681 4403 -278
Branches 447 400 -47
==========================================
- Hits 4389 4123 -266
+ Misses 292 280 -12
Continue to review full report at Codecov.
|
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
…wgr-cleanup Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
| Glow.transform(TRANSFORMER_NAME, vcfDf, options).show() | ||
| } | ||
| assert(ex.getCause.isInstanceOf[RuntimeException]) | ||
| assert(ex.getCause.getMessage.contains("is not true")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the error message look like here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this: java.lang.RuntimeException: '(size(array_repeat(0.0, input[4, int, true]), true) = 17190)' is not true!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, that's an inscrutable error message. If we give the column an informative name, does the alias appear in the error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column is already called values; maybe I can re-throw with a more helpful message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm not sure if we can re-throw in this case as the issue doesn't manifest until the DataFrame is materialized.
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
henrydavidge
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, had a few more comments.
| * @param errMsg Error message if condition fails | ||
| * @return Null if true, or throws an exception if not true | ||
| */ | ||
| def assert_true_or_error(condition: Column, errMsg: String): Column = withExpr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, actually I think it might be better to exclude this from the Python / Scala APIs. I think the use case is pretty narrow, and since this will be in the next release of Spark, I don't think there's much upside to exposing it in our public API.
You can set the exclude_python option in the YAML file. We should add an analogous exclude_scala flag. I can take care of that if you don't have time.
| isnull( | ||
| assert_true_or_error( | ||
| size(col("values")) === expectedNumValues, | ||
| "Number of values is inconsistent!"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "At least one row has an inconsistent number of values (expected x). Please verify that each row contains the same number of values."
Signed-off-by: Karen Feng <karen.feng@databricks.com>
…wgr-cleanup Signed-off-by: Karen Feng <karen.feng@databricks.com>
henrydavidge
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! LGTM
What changes are proposed in this pull request?
Caused by: org.apache.spark.sql.execution.OutOfMemorySparkException: Size of broadcasted table far exceeds estimates and exceeds limit of spark.driver.maxResultSize=4294967296. You can disable broadcasts for this query using set spark.sql.autoBroadcastJoinThreshold=-1transform_locofor the per-chromosome predictions.2,147,483,647). Vectors are allocated in sizes that are powers of two; the largest power-of-two below this limit is1,073,741,824. Each vector buffer contains a data buffer and a validity buffer. The size of the data buffer of 8-byte floats is(# elements * 8)and the size of the validity buffer is(# elements + 63) >> 6) << 3.Therefore, the maximum number of elements is132,152,839. The bottleneck of elements in each sample block/label pair is determined by the following equation:(# alphas) * (# SNPs/ # SNPs per block) * (# samples / # sample blocks). The floats are stored in an array of length(# samples / # sample blocks)in each row. There are(# SNPs/ # SNPs per block) * (# alphas)rows.How is this patch tested?