GloWGR scaling improvements #282

karenfeng · 2020-08-17T23:26:25Z

What changes are proposed in this pull request?

Pushes validation that each row of variants contains the same number of genotype values lazily into the block transformer.
Disable automatic broadcast merges that are likely to OOM, such asCaused by: org.apache.spark.sql.execution.OutOfMemorySparkException: Size of broadcasted table far exceeds estimates and exceeds limit of spark.driver.maxResultSize=4294967296. You can disable broadcasts for this query using set spark.sql.autoBroadcastJoinThreshold=-1
Adds logging during transform_loco for the per-chromosome predictions.
Adds documentation advising against performing multiallelic splitting in the same query as matrix blocking, related to Document disabling whole-stage codegen during GloWGR data prep #280.
Adds documentation regarding scalability limits. PyArrow indexes its elements with integers; as a result, the size of the vector buffer is limited to the maximum value of an integer (2,147,483,647). Vectors are allocated in sizes that are powers of two; the largest power-of-two below this limit is 1,073,741,824. Each vector buffer contains a data buffer and a validity buffer. The size of the data buffer of 8-byte floats is (# elements * 8) and the size of the validity buffer is (# elements + 63) >> 6) << 3. Therefore, the maximum number of elements is 132,152,839. The bottleneck of elements in each sample block/label pair is determined by the following equation: (# alphas) * (# SNPs/ # SNPs per block) * (# samples / # sample blocks). The floats are stored in an array of length (# samples / # sample blocks) in each row. There are (# SNPs/ # SNPs per block) * (# alphas) rows.

How is this patch tested?

Unit tests
Integration tests
Manual tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

codecov · 2020-08-18T01:08:38Z

Codecov Report

Merging #282 into master will decrease coverage by 0.12%.
The diff coverage is 96.15%.

@@            Coverage Diff             @@
##           master     #282      +/-   ##
==========================================
- Coverage   93.76%   93.64%   -0.13%     
==========================================
  Files          94       92       -2     
  Lines        4681     4403     -278     
  Branches      447      400      -47     
==========================================
- Hits         4389     4123     -266     
+ Misses        292      280      -12

Impacted Files	Coverage Δ
...o/projectglow/sql/expressions/VariantQcExprs.scala	`88.12% <95.45%> (+1.16%)`	⬆️
...ckvariantsandsamples/VariantSampleBlockMaker.scala	`100.00% <100.00%> (ø)`
...in/scala/io/projectglow/vcf/TabixIndexHelper.scala	`83.85% <0.00%> (-0.49%)`	⬇️
.../main/scala/io/projectglow/vcf/VCFFileFormat.scala	`97.33% <0.00%> (-0.17%)`	⬇️
...low/vcf/InternalRowToVariantContextConverter.scala	`93.08% <0.00%> (-0.10%)`	⬇️
...low/vcf/VariantContextToInternalRowConverter.scala	`96.50% <0.00%> (-0.02%)`	⬇️
...rojectglow/vcf/VCFLineToInternalRowConverter.scala
...e/src/main/scala/io/projectglow/sql/GlowConf.scala

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e18444f...1d299f0. Read the comment docs.

Signed-off-by: Karen Feng <karen.feng@databricks.com>

…wgr-cleanup Signed-off-by: Karen Feng <karen.feng@databricks.com>

Signed-off-by: Karen Feng <karen.feng@databricks.com>

henrydavidge · 2020-08-31T19:20:29Z

...ojectglow/transformers/blockvariantsandsamples/BlockVariantsAndSamplesTransformerSuite.scala

+      Glow.transform(TRANSFORMER_NAME, vcfDf, options).show()
+    }
+    assert(ex.getCause.isInstanceOf[RuntimeException])
+    assert(ex.getCause.getMessage.contains("is not true"))


What does the error message look like here?

It looks like this: java.lang.RuntimeException: '(size(array_repeat(0.0, input[4, int, true]), true) = 17190)' is not true!

Hm, that's an inscrutable error message. If we give the column an informative name, does the alias appear in the error message?

The column is already called values; maybe I can re-throw with a more helpful message.

Actually, I'm not sure if we can re-throw in this case as the issue doesn't manifest until the DataFrame is materialized.

docs/source/tertiary/whole-genome-regression.rst

python/glow/wgr/linear_model/ridge_model.py

Signed-off-by: Karen Feng <karen.feng@databricks.com>

henrydavidge

Sorry, had a few more comments.

henrydavidge · 2020-09-04T20:16:01Z

core/src/main/scala/io/projectglow/functions.scala

+   * @param errMsg Error message if condition fails
+   * @return Null if true, or throws an exception if not true
+   */
+  def assert_true_or_error(condition: Column, errMsg: String): Column = withExpr {


Hm, actually I think it might be better to exclude this from the Python / Scala APIs. I think the use case is pretty narrow, and since this will be in the next release of Spark, I don't think there's much upside to exposing it in our public API.

You can set the exclude_python option in the YAML file. We should add an analogous exclude_scala flag. I can take care of that if you don't have time.

henrydavidge · 2020-09-04T20:18:16Z

...main/scala/io/projectglow/transformers/blockvariantsandsamples/VariantSampleBlockMaker.scala

+      isnull(
+        assert_true_or_error(
+          size(col("values")) === expectedNumValues,
+          "Number of values is inconsistent!")))


How about "At least one row has an inconsistent number of values (expected x). Please verify that each row contains the same number of values."

Signed-off-by: Karen Feng <karen.feng@databricks.com>

…wgr-cleanup Signed-off-by: Karen Feng <karen.feng@databricks.com>

henrydavidge

Nice! LGTM

WIP

fb058a5

Signed-off-by: Karen Feng <karen.feng@databricks.com>

karenfeng added 13 commits August 18, 2020 14:01

Push number of values validation down to transformer

a2eeceb

Signed-off-by: Karen Feng <karen.feng@databricks.com>

scalafmt

a8e509c

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Explain block and hint merge

39a05ba

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Log LOCO chroms

d0e9c1b

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Remove split_multiallelics transformer from WGR recs

6ee4f1a

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Eventually no open streams

459d3e5

Signed-off-by: Karen Feng <karen.feng@databricks.com>

yapf

24dea58

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Can perform LOCO with one chrom

8eb5727

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Try reverting to broadcast join

45735ad

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Re-enable merge hint

45fe08a

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Hint another merge join

54aee17

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Merge branch 'master' of https://github.com/projectglow/glow into glo…

c02e742

…wgr-cleanup Signed-off-by: Karen Feng <karen.feng@databricks.com>

Add troubleshooting section

3ff921e

Signed-off-by: Karen Feng <karen.feng@databricks.com>

karenfeng requested a review from henrydavidge August 28, 2020 22:56

karenfeng changed the title ~~[WIP] GloWGR scaling improvements~~ GloWGR scaling improvements Aug 28, 2020

henrydavidge reviewed Aug 31, 2020

View reviewed changes

karenfeng added 2 commits August 31, 2020 13:22

Add comment about sort-merge hint

b357054

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Add more known limits

ed5b57a

Signed-off-by: Karen Feng <karen.feng@databricks.com>

karenfeng mentioned this pull request Sep 2, 2020

Move alpha into grouping expression for ridge regression fit #289

Open

karenfeng added 4 commits September 2, 2020 14:55

WIP

3ecea97

Signed-off-by: Karen Feng <karen.feng@databricks.com>

More work

9a2e012

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Change name and add tests

0cbd695

Signed-off-by: Karen Feng <karen.feng@databricks.com>

add test for null case

e0696ec

Signed-off-by: Karen Feng <karen.feng@databricks.com>

henrydavidge reviewed Sep 4, 2020

View reviewed changes

Comments - do not expose fn in Scala or Python APIs

f658b68

Signed-off-by: Karen Feng <karen.feng@databricks.com>

karenfeng requested a review from henrydavidge September 21, 2020 19:20

Merge branch 'master' of https://github.com/projectglow/glow into glo…

1d299f0

…wgr-cleanup Signed-off-by: Karen Feng <karen.feng@databricks.com>

henrydavidge approved these changes Sep 22, 2020

View reviewed changes

karenfeng merged commit 5a6c895 into projectglow:master Sep 22, 2020

karenfeng deleted the glowgr-cleanup branch September 22, 2020 17:48

karenfeng mentioned this pull request Jan 29, 2021

Slow Databricks GLOW implementation #321

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GloWGR scaling improvements #282

GloWGR scaling improvements #282

Uh oh!

karenfeng commented Aug 17, 2020 •

edited

Loading

Uh oh!

codecov bot commented Aug 18, 2020 •

edited

Loading

Uh oh!

henrydavidge Aug 31, 2020

Uh oh!

karenfeng Aug 31, 2020

Uh oh!

henrydavidge Sep 1, 2020

Uh oh!

karenfeng Sep 1, 2020

Uh oh!

karenfeng Sep 1, 2020

Uh oh!

Uh oh!

Uh oh!

henrydavidge left a comment

Uh oh!

henrydavidge Sep 4, 2020

Uh oh!

henrydavidge Sep 4, 2020

Uh oh!

henrydavidge left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GloWGR scaling improvements #282

GloWGR scaling improvements #282

Uh oh!

Conversation

karenfeng commented Aug 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How is this patch tested?

Uh oh!

codecov bot commented Aug 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

henrydavidge Aug 31, 2020

Choose a reason for hiding this comment

Uh oh!

karenfeng Aug 31, 2020

Choose a reason for hiding this comment

Uh oh!

henrydavidge Sep 1, 2020

Choose a reason for hiding this comment

Uh oh!

karenfeng Sep 1, 2020

Choose a reason for hiding this comment

Uh oh!

karenfeng Sep 1, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

henrydavidge left a comment

Choose a reason for hiding this comment

Uh oh!

henrydavidge Sep 4, 2020

Choose a reason for hiding this comment

Uh oh!

henrydavidge Sep 4, 2020

Choose a reason for hiding this comment

Uh oh!

henrydavidge left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

karenfeng commented Aug 17, 2020 •

edited

Loading

codecov bot commented Aug 18, 2020 •

edited

Loading