Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] check key ordering when isSorted = true #6267

Merged
merged 4 commits into from Jun 7, 2019

Conversation

catoverdrive
Copy link
Contributor

This was causing out-of-order keys to exist and be used in #6223. This doesn't fix that issue specifically, but it will now throw the correct error (and prevent incorrect tables/matrix tables from being written out).

tpoterba
tpoterba previously approved these changes Jun 5, 2019
@tpoterba
Copy link
Contributor

tpoterba commented Jun 5, 2019

oh, crap - the lowering rule for MatrixEntriesTable has the same problem:

E             Current key:  [1:10000,[A,G],sample_001]
E             Previous key: [1:10000,[A,G],sample_500]
E           This error can occur after a split_multi if the dataset
E           contains both multiallelic variants and duplicated loci.

@tpoterba
Copy link
Contributor

tpoterba commented Jun 6, 2019

I'm fixing that on this PR.

@tpoterba tpoterba dismissed their stale review June 6, 2019 17:11

un-approve because I committed

@tpoterba
Copy link
Contributor

tpoterba commented Jun 6, 2019

@catoverdrive I pushed a fix on here. I'll let Patrick review I think, unless you want to review my piece.

Unfortunately, this kills performance. The no-key benchmark is about 5-6x faster (20s) than the keyed benchmark (110s)

.explode(toExplode)
.mapRows('row.dropFields(toExplode).insertStruct('row (toExplode)))
.mapGlobals('global.dropFields(colsField, oldColIdx))
.keyBy(child.typ.rowKey ++ child.typ.colKey, isSorted = !(child.typ.rowKey.isEmpty && child.typ.colKey.nonEmpty))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you've ensured that this is always sorted. However, if the row key is empty (as uncommon as that might be), we don't want to be doing a collect-by-key (i.e., a collect). In that case I think we just have to shuffle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yes, good point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, added that logic and fixed the field-order problem I had separately.

@danking danking merged commit 8675e0a into hail-is:master Jun 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants