Skip to content

Conversation

@karenfeng
Copy link
Collaborator

@karenfeng karenfeng commented Aug 17, 2020

What changes are proposed in this pull request?

  • Pushes validation that each row of variants contains the same number of genotype values lazily into the block transformer.
  • Disable automatic broadcast merges that are likely to OOM, such asCaused by: org.apache.spark.sql.execution.OutOfMemorySparkException: Size of broadcasted table far exceeds estimates and exceeds limit of spark.driver.maxResultSize=4294967296. You can disable broadcasts for this query using set spark.sql.autoBroadcastJoinThreshold=-1
  • Adds logging during transform_loco for the per-chromosome predictions.
  • Adds documentation advising against performing multiallelic splitting in the same query as matrix blocking, related to Document disabling whole-stage codegen during GloWGR data prep #280.
  • Adds documentation regarding scalability limits. PyArrow indexes its elements with integers; as a result, the size of the vector buffer is limited to the maximum value of an integer (2,147,483,647). Vectors are allocated in sizes that are powers of two; the largest power-of-two below this limit is 1,073,741,824. Each vector buffer contains a data buffer and a validity buffer. The size of the data buffer of 8-byte floats is (# elements * 8) and the size of the validity buffer is (# elements + 63) >> 6) << 3. Therefore, the maximum number of elements is 132,152,839. The bottleneck of elements in each sample block/label pair is determined by the following equation: (# alphas) * (# SNPs/ # SNPs per block) * (# samples / # sample blocks). The floats are stored in an array of length (# samples / # sample blocks) in each row. There are (# SNPs/ # SNPs per block) * (# alphas) rows.

How is this patch tested?

  • Unit tests
  • Integration tests
  • Manual tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>
@codecov
Copy link

codecov bot commented Aug 18, 2020

Codecov Report

Merging #282 into master will decrease coverage by 0.12%.
The diff coverage is 96.15%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #282      +/-   ##
==========================================
- Coverage   93.76%   93.64%   -0.13%     
==========================================
  Files          94       92       -2     
  Lines        4681     4403     -278     
  Branches      447      400      -47     
==========================================
- Hits         4389     4123     -266     
+ Misses        292      280      -12     
Impacted Files Coverage Δ
...o/projectglow/sql/expressions/VariantQcExprs.scala 88.12% <95.45%> (+1.16%) ⬆️
...ckvariantsandsamples/VariantSampleBlockMaker.scala 100.00% <100.00%> (ø)
...in/scala/io/projectglow/vcf/TabixIndexHelper.scala 83.85% <0.00%> (-0.49%) ⬇️
.../main/scala/io/projectglow/vcf/VCFFileFormat.scala 97.33% <0.00%> (-0.17%) ⬇️
...low/vcf/InternalRowToVariantContextConverter.scala 93.08% <0.00%> (-0.10%) ⬇️
...low/vcf/VariantContextToInternalRowConverter.scala 96.50% <0.00%> (-0.02%) ⬇️
...rojectglow/vcf/VCFLineToInternalRowConverter.scala
...e/src/main/scala/io/projectglow/sql/GlowConf.scala

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e18444f...1d299f0. Read the comment docs.

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
…wgr-cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
@karenfeng karenfeng requested a review from henrydavidge August 28, 2020 22:56
@karenfeng karenfeng changed the title [WIP] GloWGR scaling improvements GloWGR scaling improvements Aug 28, 2020
Glow.transform(TRANSFORMER_NAME, vcfDf, options).show()
}
assert(ex.getCause.isInstanceOf[RuntimeException])
assert(ex.getCause.getMessage.contains("is not true"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the error message look like here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this: java.lang.RuntimeException: '(size(array_repeat(0.0, input[4, int, true]), true) = 17190)' is not true!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's an inscrutable error message. If we give the column an informative name, does the alias appear in the error message?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column is already called values; maybe I can re-throw with a more helpful message.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm not sure if we can re-throw in this case as the issue doesn't manifest until the DataFrame is materialized.

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Copy link
Contributor

@henrydavidge henrydavidge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, had a few more comments.

* @param errMsg Error message if condition fails
* @return Null if true, or throws an exception if not true
*/
def assert_true_or_error(condition: Column, errMsg: String): Column = withExpr {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, actually I think it might be better to exclude this from the Python / Scala APIs. I think the use case is pretty narrow, and since this will be in the next release of Spark, I don't think there's much upside to exposing it in our public API.

You can set the exclude_python option in the YAML file. We should add an analogous exclude_scala flag. I can take care of that if you don't have time.

isnull(
assert_true_or_error(
size(col("values")) === expectedNumValues,
"Number of values is inconsistent!")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "At least one row has an inconsistent number of values (expected x). Please verify that each row contains the same number of values."

Signed-off-by: Karen Feng <karen.feng@databricks.com>
…wgr-cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Copy link
Contributor

@henrydavidge henrydavidge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM

@karenfeng karenfeng merged commit 5a6c895 into projectglow:master Sep 22, 2020
@karenfeng karenfeng deleted the glowgr-cleanup branch September 22, 2020 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants