Skip to content

[bugfix] check key ordering when isSorted = true#6267

Merged
danking merged 4 commits intohail-is:masterfrom
catoverdrive:ldprune-mismatch
Jun 7, 2019
Merged

[bugfix] check key ordering when isSorted = true#6267
danking merged 4 commits intohail-is:masterfrom
catoverdrive:ldprune-mismatch

Conversation

@catoverdrive
Copy link
Contributor

This was causing out-of-order keys to exist and be used in #6223. This doesn't fix that issue specifically, but it will now throw the correct error (and prevent incorrect tables/matrix tables from being written out).

tpoterba
tpoterba previously approved these changes Jun 5, 2019
@tpoterba
Copy link
Contributor

tpoterba commented Jun 5, 2019

oh, crap - the lowering rule for MatrixEntriesTable has the same problem:

E             Current key:  [1:10000,[A,G],sample_001]
E             Previous key: [1:10000,[A,G],sample_500]
E           This error can occur after a split_multi if the dataset
E           contains both multiallelic variants and duplicated loci.

@tpoterba
Copy link
Contributor

tpoterba commented Jun 6, 2019

I'm fixing that on this PR.

@tpoterba tpoterba dismissed their stale review June 6, 2019 17:11

un-approve because I committed

@tpoterba
Copy link
Contributor

tpoterba commented Jun 6, 2019

@catoverdrive I pushed a fix on here. I'll let Patrick review I think, unless you want to review my piece.

Unfortunately, this kills performance. The no-key benchmark is about 5-6x faster (20s) than the keyed benchmark (110s)

.explode(toExplode)
.mapRows('row.dropFields(toExplode).insertStruct('row (toExplode)))
.mapGlobals('global.dropFields(colsField, oldColIdx))
.keyBy(child.typ.rowKey ++ child.typ.colKey, isSorted = !(child.typ.rowKey.isEmpty && child.typ.colKey.nonEmpty))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you've ensured that this is always sorted. However, if the row key is empty (as uncommon as that might be), we don't want to be doing a collect-by-key (i.e., a collect). In that case I think we just have to shuffle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yes, good point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, added that logic and fixed the field-order problem I had separately.

@danking danking merged commit 8675e0a into hail-is:master Jun 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants