Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query/ptypes] remove PType.canonical from TableJoin #8436

Merged
merged 13 commits into from Apr 6, 2020

Conversation

akotlar
Copy link
Contributor

@akotlar akotlar commented Apr 2, 2020

No description provided.

@akotlar
Copy link
Contributor Author

akotlar commented Apr 2, 2020

failing tests (outer join) due to requiredeness

@akotlar akotlar changed the title [query/ptypess] remove canonical from tablejoin [query/ptypess] remove PType.canonical from TableJoin Apr 2, 2020
val leftValueType = leftRVDType.valueType
val rightValueType = rightRVDType.valueType

val newRowPType = leftKeyType ++ leftValueType ++ rightValueType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is correct for inner join, but not left, right, or outer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one sec, changing to match to ensure always exhaustive

var rightValueType = rightRVDType.valueType

joinType match {
case "inner" => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no curly brace necessary in a match clause

@@ -622,7 +623,31 @@ case class TableJoin(left: TableIR, right: TableIR, joinType: String, joinKey: I
val rightKeyFieldIdx = rightRVDType.kFieldIdx
val leftValueFieldIdx = leftRVDType.valueFieldIdx
val rightValueFieldIdx = rightRVDType.valueFieldIdx
val newRowPType = PType.canonical(newRowType).asInstanceOf[PStruct]

val leftKeyType = leftRVDType.kType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the requiredness of key fields depends on the join type too. Suppose the left key is required, right key is optional -- an inner join should return required, an outer join should return optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't sure about this, because in the relational db setting 1) the notion of a primary key is that it is always present, it is the unique identifier of the row 2) one joins on keys and returns values. So values can be missing as a consequence of the kind of join, but not keys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing and duplicate keys are totally supported in Hail, yep. An outer join of a table with a required key and a table with an optional key can totally have missing keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should use hail more


val leftKeyType = leftRVDType.kType
var leftValueType = leftRVDType.valueType
var rightValueType = rightRVDType.valueType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var + mutation is a less clear pattern than having the match return a triple of (key fields, left fields, right fields). These can all have type Array[(String, PType)]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, we don't need the setFieldsRequiredness method on PCanonicalStruct, it adds complexity without good reason

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var + mutation is a less clear pattern than having the match return a triple of (key fields, left fields, right fields). These can all have type Array[(String, PType)]

I disagree. Tried both ways this was more ergonomic.

Copy link
Contributor

@tpoterba tpoterba Apr 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may have saved a couple lines of code, but it also means the reader needs to look in two places to be convinced that a rule for a specific join type is correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may have saved a couple lines of code, but it also means the reader needs to look

It's a standard pattern: initialize variables outside of the block where their values are set.

also, we don't need the setFieldsRequiredness method on PCanonicalStruct

Without having this on PCanonicalStruct we need to cast to PCanonicalStruct, because we don't have a "copy" constructor on ptypes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without having this on PCanonicalStruct we need to cast to PCanonicalStruct

We don't actually need a struct at all! We should get rid of ++ on PStruct entirely, this code should construct a PCanonicalStruct passing in fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah true can do this :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically all the abstract methods on PStruct that modify the type need to go. update/appendKey, ++, typeAfterSelect, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we wouldn’t want to modify the rvd row type, just subset it, but if we don’t mind that the strict will become canonical in the future when we have non-canonical structs, then your suggestion works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed assuming your opinion doesn't change

val lValueTypeFields = castFieldRequiredeness(leftRVDType.valueType, false)
(leftRVDType.kType.fields, lValueTypeFields, rightRVDType.valueType.fields)
case "outer" | "zip" =>
val keyTypeFields = castFieldRequiredeness(leftRVDType.kType, false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can actually be the union of requiredeness of keys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not can, must! All of the join strategies need to do a full union (InferPType.deepUnionIForgetTheName) on the key.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(they all do, but you can do a setRequired to the left in left join, right in right join, && in inner)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(they all do, but you can do a setRequired to the left in left join, right in right join, && in inner)

Yes, this is correct. Left needs the left type, right should have the same field names but the right's ptype, and outer/zip get the union.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test. We did not have one that could correctly identify requiredeness differences between the two tables during left/right

@tpoterba tpoterba changed the title [query/ptypess] remove PType.canonical from TableJoin [query/ptypes] remove PType.canonical from TableJoin Apr 5, 2020
pfs.map(pf => (pf.name, pf.typ))

def castFieldRequiredeness(ps: PStruct, required: Boolean): IndexedSeq[PField] = {
if(ps.fields.forall(_.typ.required == required)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This early return is not worth it. The extra copy won't affect performance, and this is an extra way for things to go wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, you can just have this return IndexedSeq[(String, PType)] and then you don't need noIndex.

val lValueTypeFields = castFieldRequiredeness(leftRVDType.valueType, false)
(leftRVDType.kType.fields, lValueTypeFields, rightRVDType.valueType.fields)
case "outer" | "zip" =>
val keyTypeFields = castFieldRequiredeness(leftRVDType.kType, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not can, must! All of the join strategies need to do a full union (InferPType.deepUnionIForgetTheName) on the key.

val lValueTypeFields = castFieldRequiredeness(leftRVDType.valueType, false)
(leftRVDType.kType.fields, lValueTypeFields, rightRVDType.valueType.fields)
case "outer" | "zip" =>
val keyTypeFields = castFieldRequiredeness(leftRVDType.kType, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(they all do, but you can do a setRequired to the left in left join, right in right join, && in inner)

@@ -991,6 +991,40 @@ def row(new_key, key2, idx1, idx2):
self.assertEqual(inner_join.collect(), inner_join_expected)
self.assertEqual(outer_join.collect(), outer_join_expected)

def test_joins_one_null(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needed to verify that key requiredeness handled correctly in left/right join

Copy link
Contributor

@tpoterba tpoterba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! just about there. Thanks for adding tests.

@@ -613,6 +614,25 @@ case class TableJoin(left: TableIR, right: TableIR, joinType: String, joinKey: I
joinKey)
}

def noIndex(pfs: IndexedSeq[PField]): IndexedSeq[(String, PType)] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move these definitions inside of execute, so they're not visible to consumers of the TableAggregate class.

val rValueTypeFields = castFieldRequiredeness(rightRVDType.valueType, false)
(noIndex(leftRVDType.kType.fields), noIndex(leftRVDType.valueType.fields), rValueTypeFields)
case "right" =>
val keyTypeFields = castFieldToPType(leftRVDType.kType, rightRVDType.kType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

castFieldToPType has a confusing name. Probably clearer to do something like the following inline:

leftRVDType.key.zip(rightRVDType.kType.types)

@tpoterba
Copy link
Contributor

tpoterba commented Apr 6, 2020

Nice work. Let's block release on this, clearly it was bugged before.

@danking danking merged commit 4484e6c into hail-is:master Apr 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants