Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Python generating Terrible, Horrible, No Good, Very Bad IR #5107

Merged
merged 6 commits into from Jan 22, 2019

Conversation

Projects
None yet
2 participants
@tpoterba
Copy link
Collaborator

commented Jan 10, 2019

Stacked on #5103

Fixes #5100

This PR changes the Python select and key_by operators to generate
the IR we'd expect them to be generating (e.g. ht.select('x') emits
a SelectFields instead of a MakeStruct).

In the process, I found and fixed a bug in group expressions for
GroupedMatrixTable. This is tested for both tables and matrix tables
in the new tests in test_table and test_matrix_table.

Some timings:

>>> mt = hl.read_matrix_table('/Users/tpoterba/data/profile.mt')
>>> %timeit mt.select_entries('GT')._force_count_rows()

master:

1.64 s ± 106 ms per loop

PR:

967 ms ± 61.1 ms per loop
@tpoterba

This comment has been minimized.

Copy link
Collaborator Author

commented Jan 10, 2019

This is probably even more pronounced for BGEN than MT!

@tpoterba tpoterba force-pushed the tpoterba:fix-select-python branch from 0d2d782 to dfd19eb Jan 10, 2019

@tpoterba

This comment has been minimized.

Copy link
Collaborator Author

commented Jan 10, 2019

This is a meaty PR, I'm sorry for that. But I think it'll be pretty hard to break up :(

I can't, for instance, easily fix the bugs in GroupedMatrixTable separately from the infrastructure code.

@tpoterba tpoterba removed the stacked PR label Jan 10, 2019

@danking

This comment has been minimized.

Copy link
Collaborator

commented Jan 10, 2019

Yeah. It seems there’s some clean up (check vs get, same patttern for all the annotate/select methids), then some group stuff that I haven’t unpacked, and I haven’t found what fixed the bug yet.

@tpoterba

This comment has been minimized.

Copy link
Collaborator Author

commented Jan 10, 2019

The bug was fixed by the logic in result that uses UIDs for computed keys, renaming them after aggregation. I'm happy to walk you through this if you want.

@danking
Copy link
Collaborator

left a comment

First pass. There's a lot of changes in matrix table.py that I haven't looked at yet. I'm still not sure where the fix was

h = h * 31 + hash(self.source)
for a in self.axes:
h = h * 31 + hash(a)
return h

This comment has been minimized.

Copy link
@danking

danking Jan 11, 2019

Collaborator

is it bad if we just use return hash((self.source, *self.axes))?

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jan 15, 2019

Author Collaborator

yeah, that's probably fine.

@@ -31,24 +41,30 @@ def unify(*indices):
return Indices(src, axes)

@property
def key(self):
def protected_key(self) -> List[str]:
if self._initialized_key:

This comment has been minimized.

Copy link
@danking

danking Jan 11, 2019

Collaborator

can we just use self._cached_key is None instead? it looks like you've changed get_key to never return None otherwise.

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jan 15, 2019

Author Collaborator

oh yeah, totally

protected_key = set(select_fields)
insertions = []

final_fields = select_fields[:]

This comment has been minimized.

Copy link
@danking

danking Jan 11, 2019

Collaborator

you do a few copies of indices.protected_key. In particular, the construction of the set copies the elements no?

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jan 15, 2019

Author Collaborator

two copies, yes. Maybe Indices.protected_key should return a tuple or something (Which would need to be copied), but the set copy is for faster membership checks, and the copy on line 367 is because we intend to mutate final_fields.

This comment has been minimized.

Copy link
@danking

danking Jan 16, 2019

Collaborator

three copies, select_fields and protected_key are both copies as well, but it looks like you need all three copies.


for e in exprs:
if not e._ir.is_nested_field:
raise ExpressionException("method '{}' expects keyword arguments for complex expressions".format(caller))
raise ExpressionException(f"{caller!r} expects keyword arguments for complex expressions\n"

This comment has been minimized.

Copy link
@danking

danking Jan 11, 2019

Collaborator

what does !r do?

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jan 15, 2019

Author Collaborator

same as {repr(caller)}. I discovered it recently.

s = base_struct.select(*select_fields)._annotate_ordered(dict(insertions), final_fields)

assert list(s) == final_fields
return s

This comment has been minimized.

Copy link
@danking

danking Jan 11, 2019

Collaborator

This method returns a struct instead of a dictionary of keys to expressions. That's why there's changes everywhere that get_select_exprs is used?

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jan 15, 2019

Author Collaborator

yes, exactly. And it simplifies those locations, too.

assignments = OrderedDict()
select_fields = indices.protected_key[:]
protected_key = set(select_fields)
insertions = []

This comment has been minimized.

Copy link
@danking

danking Jan 11, 2019

Collaborator

Why not start as an empty dict and assign key value pairs instead of appending?

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jan 15, 2019

Author Collaborator

ah, yeah, that'll work. I had refactored this a bit w.r.t. name collision checking, and was originally using insertions to check that you didn't try to insert two thing with the same name.

from hail.expr.expressions import to_expr, ExpressionException
check_keys(caller, k, protected_key)
insertions.append((k, e))
final_fields.append(k)

This comment has been minimized.

Copy link
@danking

danking Jan 11, 2019

Collaborator

do final_fields before insertions.append so it mirrors the above ordering

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jan 15, 2019

Author Collaborator

good call

# don't clog the IR with redundant field names
s = base_struct.select(*select_fields).annotate(**dict(insertions))
else:
s = base_struct.select(*select_fields)._annotate_ordered(dict(insertions), final_fields)

This comment has been minimized.

Copy link
@danking

danking Jan 11, 2019

Collaborator

Can you explain why field insertion order matters when we have nested fields.

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jan 15, 2019

Author Collaborator

changed it to:

    if final_fields != select_fields + list(insertions):

and removed the select field logic.

The issue comes when you have something like:

select(ht.a, ht.b.c, ht.d, e = 'foo')

We want the result type to be struct{a, c, d, e} but want to emit `select(a, d)._annotate_ordered(dict(c = ht.b.c, e = 'foo'), ['a', 'c', 'd', 'e'])


The change is both simpler and stronger.
protected_key = set(indices.protected_key)
for k in named_exprs:
check_keys(caller, k, protected_key)
check_collisions(caller, list(named_exprs), indices)
return named_exprs

This comment has been minimized.

Copy link
@danking

danking Jan 11, 2019

Collaborator

The return value is ignored in the uses in this PR. Do we save work by not calling to_expr(v)?

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jan 15, 2019

Author Collaborator

to_expr is called by the typecheckers, actually (or should be -- I added them in many places). There's nothing to do except check, now! :)

This comment has been minimized.

Copy link
@danking

danking Jan 18, 2019

Collaborator

I mean that check_annotate_exprs returns named_exprs, but every call site of check_annotate_exprs ignores the return value. check_annotate_exprs only looks at the keys of named_exprs, so why bother with the second line of this method:

named_exprs = {k: to_expr(v) for k, v in named_exprs.items()}

mt3 = (mt.group_cols_by(col_idx='100')
.aggregate(x=hl.agg.collect_as_set(mt.col_idx + 5)))
assert mt3.aggregate_entries(hl.agg.all(mt3.x == hl.set({5, 6, 7})))

This comment has been minimized.

Copy link
@danking

danking Jan 11, 2019

Collaborator

why the same test twice?

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jan 15, 2019

Author Collaborator

one for group_rows_by, one for group_cols_by

This comment has been minimized.

Copy link
@danking

danking Jan 18, 2019

Collaborator

👍

@tpoterba tpoterba force-pushed the tpoterba:fix-select-python branch from dfd19eb to 6bf375d Jan 15, 2019

addressed

@danking

This comment has been minimized.

Copy link
Collaborator

commented Jan 17, 2019

looks like some fields (cols) are changing names

@danking danking referenced this pull request Jan 17, 2019

Merged

make ld_prune fast again #5078

protected_key = set(indices.protected_key)
for k in named_exprs:
check_keys(caller, k, protected_key)
check_collisions(caller, list(named_exprs), indices)
return named_exprs

This comment has been minimized.

Copy link
@danking

danking Jan 18, 2019

Collaborator

I mean that check_annotate_exprs returns named_exprs, but every call site of check_annotate_exprs ignores the return value. check_annotate_exprs only looks at the keys of named_exprs, so why bother with the second line of this method:

named_exprs = {k: to_expr(v) for k, v in named_exprs.items()}

mt3 = (mt.group_cols_by(col_idx='100')
.aggregate(x=hl.agg.collect_as_set(mt.col_idx + 5)))
assert mt3.aggregate_entries(hl.agg.all(mt3.x == hl.set({5, 6, 7})))

This comment has been minimized.

Copy link
@danking

danking Jan 18, 2019

Collaborator

👍

tpoterba added some commits Jan 10, 2019

Fixes #5100
Fix Python generating Terrible, Horrible, No Good, Very Bad IR

This PR changes the Python select and key_by operators to generate
the IR we'd expect them to be generating (e.g. `ht.select('x')` emits
a `SelectFields` instead of a `MakeStruct`).

In the process, I found and fixed a bug in group expressions for
`GroupedMatrixTable`. This is tested for both tables and matrix tables
in the new tests in `test_table` and `test_matrix_table`.

Some timings:

    >>> mt = hl.read_matrix_table('/Users/tpoterba/data/profile.mt')
    >>> %timeit mt.select_entries('GT')._force_count_rows()

master:

    1.64 s ± 106 ms per loop

PR:

    967 ms ± 61.1 ms per loop

@tpoterba tpoterba force-pushed the tpoterba:fix-select-python branch from e4c2bcd to ca7df7d Jan 22, 2019

@danking
Copy link
Collaborator

left a comment

Merge, baby, merge

@danking danking merged commit 8c29706 into hail-is:master Jan 22, 2019

1 check passed

hail-ci-0-1 successful build
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.