Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow 0 partitions in big VCF/BGEN writers and pipe transformer #164

Merged
merged 6 commits into from Feb 24, 2020

Conversation

karenfeng
Copy link
Collaborator

@karenfeng karenfeng commented Feb 24, 2020

What changes are proposed in this pull request?

We currently have inconsistent behavior between the following tools that act on RDD partitions in the case that there are 0 partitions:

  • Big BGEN writer
  • Big VCF writer
  • Pipe transformer

Now, we replace the underlying 0-row, 0-partition RDD with a 0-row, 1-partition RDD. This allows users to write big VCF/BGENs or pipe with the header alone. This will un-break our tests on SPARK-30780, which causes empty DataFrames to (usually) be backed by an empty RDD with 0 partitions.

How is this patch tested?

  • Unit tests
  • Integration tests
  • Manual tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
@codecov
Copy link

codecov bot commented Feb 24, 2020

Codecov Report

Merging #164 into master will increase coverage by 0.07%.
The diff coverage is 97.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #164      +/-   ##
==========================================
+ Coverage   92.06%   92.14%   +0.07%     
==========================================
  Files          86       86              
  Lines        4123     4135      +12     
  Branches      381      389       +8     
==========================================
+ Hits         3796     3810      +14     
+ Misses        327      325       -2
Impacted Files Coverage Δ
...in/scala/io/projectglow/vcf/BigVCFDatasource.scala 100% <100%> (ø) ⬆️
.../scala/io/projectglow/bgen/BigBgenDatasource.scala 100% <100%> (ø) ⬆️
...src/main/scala/org/apache/spark/sql/SQLUtils.scala 94.11% <100%> (+0.36%) ⬆️
...scala/io/projectglow/transformers/pipe/Piper.scala 94.2% <92.3%> (-1.11%) ⬇️
...n/scala/io/projectglow/bgen/BgenRecordWriter.scala 100% <0%> (+2.25%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3370ad0...6c2e956. Read the comment docs.

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
@karenfeng karenfeng changed the title Disallow 0 partitions in big BGEN writer and pipe transformer Allow 0 partitions in big VC/FBGEN writers and pipe transformer Feb 24, 2020
@karenfeng karenfeng changed the title Allow 0 partitions in big VC/FBGEN writers and pipe transformer Allow 0 partitions in big VCF/BGEN writers and pipe transformer Feb 24, 2020
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
@karenfeng karenfeng merged commit 4f02f14 into projectglow:master Feb 24, 2020
@karenfeng karenfeng deleted the fix-empty-df-behavior branch February 24, 2020 23:18
kianfar77 pushed a commit to kianfar77/glow that referenced this pull request Feb 28, 2020
…ectglow#164)

* Check for 0 partitions in big BGEN writer and piper

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Scalastyle

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Testing

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Replace empty, 0-partition RDD with empty, 1-partition RDD

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Clean up

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Undo cosmetic changes

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>
henrydavidge pushed a commit to henrydavidge/glow that referenced this pull request Jun 22, 2020
…ectglow#164)

* Check for 0 partitions in big BGEN writer and piper

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Scalastyle

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Testing

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Replace empty, 0-partition RDD with empty, 1-partition RDD

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Clean up

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Undo cosmetic changes

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Signed-off-by: Henry Davidge <hhd@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants