Allow 0 partitions in big VCF/BGEN writers and pipe transformer #164

karenfeng · 2020-02-24T19:30:28Z

What changes are proposed in this pull request?

We currently have inconsistent behavior between the following tools that act on RDD partitions in the case that there are 0 partitions:

Big BGEN writer
Big VCF writer
Pipe transformer

Now, we replace the underlying 0-row, 0-partition RDD with a 0-row, 1-partition RDD. This allows users to write big VCF/BGENs or pipe with the header alone. This will un-break our tests on SPARK-30780, which causes empty DataFrames to (usually) be backed by an empty RDD with 0 partitions.

How is this patch tested?

Unit tests
Integration tests
Manual tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

codecov · 2020-02-24T19:48:35Z

Codecov Report

Merging #164 into master will increase coverage by 0.07%.
The diff coverage is 97.29%.

@@            Coverage Diff             @@
##           master     #164      +/-   ##
==========================================
+ Coverage   92.06%   92.14%   +0.07%     
==========================================
  Files          86       86              
  Lines        4123     4135      +12     
  Branches      381      389       +8     
==========================================
+ Hits         3796     3810      +14     
+ Misses        327      325       -2

Impacted Files	Coverage Δ
...in/scala/io/projectglow/vcf/BigVCFDatasource.scala	`100% <100%> (ø)`	⬆️
.../scala/io/projectglow/bgen/BigBgenDatasource.scala	`100% <100%> (ø)`	⬆️
...src/main/scala/org/apache/spark/sql/SQLUtils.scala	`94.11% <100%> (+0.36%)`	⬆️
...scala/io/projectglow/transformers/pipe/Piper.scala	`94.2% <92.3%> (-1.11%)`	⬇️
...n/scala/io/projectglow/bgen/BgenRecordWriter.scala	`100% <0%> (+2.25%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3370ad0...6c2e956. Read the comment docs.

Signed-off-by: Karen Feng <karen.feng@databricks.com>

…ectglow#164) * Check for 0 partitions in big BGEN writer and piper Signed-off-by: Karen Feng <karen.feng@databricks.com> * Scalastyle Signed-off-by: Karen Feng <karen.feng@databricks.com> * Testing Signed-off-by: Karen Feng <karen.feng@databricks.com> * Replace empty, 0-partition RDD with empty, 1-partition RDD Signed-off-by: Karen Feng <karen.feng@databricks.com> * Clean up Signed-off-by: Karen Feng <karen.feng@databricks.com> * Undo cosmetic changes Signed-off-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

…ectglow#164) * Check for 0 partitions in big BGEN writer and piper Signed-off-by: Karen Feng <karen.feng@databricks.com> * Scalastyle Signed-off-by: Karen Feng <karen.feng@databricks.com> * Testing Signed-off-by: Karen Feng <karen.feng@databricks.com> * Replace empty, 0-partition RDD with empty, 1-partition RDD Signed-off-by: Karen Feng <karen.feng@databricks.com> * Clean up Signed-off-by: Karen Feng <karen.feng@databricks.com> * Undo cosmetic changes Signed-off-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Henry Davidge <hhd@databricks.com>

karenfeng added 2 commits February 24, 2020 11:27

Check for 0 partitions in big BGEN writer and piper

8bec65e

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Scalastyle

d500634

Signed-off-by: Karen Feng <karen.feng@databricks.com>

karenfeng added 2 commits February 24, 2020 13:36

Testing

1a7f9ee

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Replace empty, 0-partition RDD with empty, 1-partition RDD

74a3a1d

Signed-off-by: Karen Feng <karen.feng@databricks.com>

karenfeng changed the title ~~Disallow 0 partitions in big BGEN writer and pipe transformer~~ Allow 0 partitions in big VC/FBGEN writers and pipe transformer Feb 24, 2020

karenfeng changed the title ~~Allow 0 partitions in big VC/FBGEN writers and pipe transformer~~ Allow 0 partitions in big VCF/BGEN writers and pipe transformer Feb 24, 2020

Clean up

2a5da52

Signed-off-by: Karen Feng <karen.feng@databricks.com>

karenfeng requested a review from fnothaft February 24, 2020 22:28

fnothaft approved these changes Feb 24, 2020

View reviewed changes

Undo cosmetic changes

6c2e956

Signed-off-by: Karen Feng <karen.feng@databricks.com>

karenfeng merged commit 4f02f14 into projectglow:master Feb 24, 2020

karenfeng deleted the fix-empty-df-behavior branch February 24, 2020 23:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow 0 partitions in big VCF/BGEN writers and pipe transformer #164

Allow 0 partitions in big VCF/BGEN writers and pipe transformer #164

karenfeng commented Feb 24, 2020 •

edited

codecov bot commented Feb 24, 2020 •

edited

Allow 0 partitions in big VCF/BGEN writers and pipe transformer #164

Allow 0 partitions in big VCF/BGEN writers and pipe transformer #164

Conversation

karenfeng commented Feb 24, 2020 • edited

What changes are proposed in this pull request?

How is this patch tested?

codecov bot commented Feb 24, 2020 • edited

Codecov Report

karenfeng commented Feb 24, 2020 •

edited

codecov bot commented Feb 24, 2020 •

edited