Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] check key ordering when isSorted = true #6267

Merged
merged 4 commits into from Jun 7, 2019

Conversation

@catoverdrive
Copy link
Contributor

@catoverdrive catoverdrive commented Jun 5, 2019

This was causing out-of-order keys to exist and be used in #6223. This doesn't fix that issue specifically, but it will now throw the correct error (and prevent incorrect tables/matrix tables from being written out).

@tpoterba
Copy link
Collaborator

@tpoterba tpoterba commented Jun 5, 2019

oh, crap - the lowering rule for MatrixEntriesTable has the same problem:

E             Current key:  [1:10000,[A,G],sample_001]
E             Previous key: [1:10000,[A,G],sample_500]
E           This error can occur after a split_multi if the dataset
E           contains both multiallelic variants and duplicated loci.

@tpoterba
Copy link
Collaborator

@tpoterba tpoterba commented Jun 6, 2019

I'm fixing that on this PR.

@tpoterba tpoterba dismissed their stale review Jun 6, 2019

un-approve because I committed

@tpoterba
Copy link
Collaborator

@tpoterba tpoterba commented Jun 6, 2019

@catoverdrive I pushed a fix on here. I'll let Patrick review I think, unless you want to review my piece.

Unfortunately, this kills performance. The no-key benchmark is about 5-6x faster (20s) than the keyed benchmark (110s)

.explode(toExplode)
.mapRows('row.dropFields(toExplode).insertStruct('row (toExplode)))
.mapGlobals('global.dropFields(colsField, oldColIdx))
.keyBy(child.typ.rowKey ++ child.typ.colKey, isSorted = !(child.typ.rowKey.isEmpty && child.typ.colKey.nonEmpty))
Copy link
Collaborator

@patrick-schultz patrick-schultz Jun 6, 2019

I think you've ensured that this is always sorted. However, if the row key is empty (as uncommon as that might be), we don't want to be doing a collect-by-key (i.e., a collect). In that case I think we just have to shuffle.

Copy link
Collaborator

@tpoterba tpoterba Jun 6, 2019

ah, yes, good point.

Copy link
Collaborator

@tpoterba tpoterba Jun 6, 2019

ok, added that logic and fixed the field-order problem I had separately.

@danking danking merged commit 8675e0a into hail-is:master Jun 7, 2019
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants