Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] check key ordering when isSorted = true #6267

Merged
merged 4 commits into from Jun 7, 2019

Conversation

Projects
None yet
4 participants
@catoverdrive
Copy link
Collaborator

commented Jun 5, 2019

This was causing out-of-order keys to exist and be used in #6223. This doesn't fix that issue specifically, but it will now throw the correct error (and prevent incorrect tables/matrix tables from being written out).

@catoverdrive catoverdrive force-pushed the catoverdrive:ldprune-mismatch branch from f5082d6 to 7ad53cb Jun 5, 2019

@tpoterba

This comment has been minimized.

Copy link
Collaborator

commented Jun 5, 2019

oh, crap - the lowering rule for MatrixEntriesTable has the same problem:

E             Current key:  [1:10000,[A,G],sample_001]
E             Previous key: [1:10000,[A,G],sample_500]
E           This error can occur after a split_multi if the dataset
E           contains both multiallelic variants and duplicated loci.
@tpoterba

This comment has been minimized.

Copy link
Collaborator

commented Jun 6, 2019

I'm fixing that on this PR.

un-approve because I committed

@tpoterba

This comment has been minimized.

Copy link
Collaborator

commented Jun 6, 2019

@catoverdrive I pushed a fix on here. I'll let Patrick review I think, unless you want to review my piece.

Unfortunately, this kills performance. The no-key benchmark is about 5-6x faster (20s) than the keyed benchmark (110s)

.explode(toExplode)
.mapRows('row.dropFields(toExplode).insertStruct('row (toExplode)))
.mapGlobals('global.dropFields(colsField, oldColIdx))
.keyBy(child.typ.rowKey ++ child.typ.colKey, isSorted = !(child.typ.rowKey.isEmpty && child.typ.colKey.nonEmpty))

This comment has been minimized.

Copy link
@patrick-schultz

patrick-schultz Jun 6, 2019

Collaborator

I think you've ensured that this is always sorted. However, if the row key is empty (as uncommon as that might be), we don't want to be doing a collect-by-key (i.e., a collect). In that case I think we just have to shuffle.

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jun 6, 2019

Collaborator

ah, yes, good point.

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jun 6, 2019

Collaborator

ok, added that logic and fixed the field-order problem I had separately.

@danking danking merged commit 8675e0a into hail-is:master Jun 7, 2019

1 check passed

ci-test success
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.