Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail][feature] Super array expression #5913

Merged
merged 19 commits into from May 9, 2019

Conversation

Projects
None yet
4 participants
@danking
Copy link
Collaborator

commented Apr 18, 2019

First step towards treating CollectionExpressions exactly like non-distributed tables. Works for arrays and sets.

cc: @konradjk

>>> a = [hl.struct(b=[hl.struct(inner=1),
...                   hl.struct(inner=2)]),
...      hl.struct(b=[hl.struct(inner=3)])]
>>> hl.eval(a.b)
[[hl.struct(inner=1), hl.struct(inner=2)], [hl.struct(inner=3)]]
>>> hl.eval(a.b.inner)
[[1, 2], [3]]
>>> hl.eval(hl.flatten(a.b).inner)
[1, 2, 3]
>>> hl.eval(hl.flatten(a.b.inner))
[1, 2, 3]

@danking danking changed the title Super array expression [hail][feature] Super array expression Apr 18, 2019

if "Cannot use 'collection.foo' or" not in str(err):
raise
raise TypeError(
f"Cannot use 'collection.foo' or 'collection['foo']' on "

This comment has been minimized.

Copy link
@konradjk

konradjk Apr 18, 2019

Collaborator

Do you actually want foo in here? Could you use item, or does the recursion mess with this?

This comment has been minimized.

Copy link
@danking

danking Apr 18, 2019

Author Collaborator

oh sure yeah item

This comment has been minimized.

Copy link
@danking

danking Apr 18, 2019

Author Collaborator

done

@patrick-schultz

This comment has been minimized.

Copy link
Collaborator

commented Apr 19, 2019

Should we support array['field name'] for when field name isn't a valid python identifier? We'd have to disambiguate that from array indexing, but one is always a python string and one always (convertible to) a numeric expression, right?

@danking

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 19, 2019

@patrick-schultz I am one step ahead of you sir:

https://github.com/hail-is/hail/pull/5913/files#diff-8a90eaa51bf133eaef5d7f12c0d4decfR2736

If you pass a python string, I use the array projection approach, if you pass anything else I try to convert it to an integer.

@danking

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 19, 2019

Every time I see this PR I get giddy. This is so in line with the Hail I want.

etype = self.dtype.element_type
if isinstance(etype, hl.tstruct):
return hl.struct(
**{k: self.map(lambda x: x[k]) for k in etype}).__getitem__(item)

This comment has been minimized.

Copy link
@tpoterba

tpoterba Apr 19, 2019

Collaborator

I don't understand this bit. Shouldn't this be map(lambda x: x[item])?

This comment has been minimized.

Copy link
@tpoterba

tpoterba Apr 19, 2019

Collaborator

You also can't use hl.struct here -- that can change missingness patterns

This comment has been minimized.

Copy link
@tpoterba

tpoterba Apr 19, 2019

Collaborator

Design comment - We use specialized expression classes for expression types that have unique behavior. Shouldn't we use that model here? A NestedStructCollectionExpression type of thing?

This comment has been minimized.

Copy link
@danking

danking Apr 22, 2019

Author Collaborator

This converts from the row format to the column format, then it performs the get item. I suppose it can be directly written as:

return self.map(lambda x: x[item])

Conceptually though, I think of myself as operating on the column formatted data. I'll change to self.map.

**{k: self.map(lambda x: x[k]) for k in etype}).__getitem__(item)
try:
if isinstance(etype, (hl.tarray, hl.tset)):
return self.map(lambda x: x.__getitem__(item))

This comment has been minimized.

Copy link
@tpoterba

tpoterba Apr 19, 2019

Collaborator

using __getitem__ in this way is bad style -- use x[item]

addressed

@tpoterba
Copy link
Collaborator

left a comment

few last comments

if isinstance(type, tndarray) and is_numeric(type.element_type):
return NDArrayNumericExpression(ir, type, indices, aggregations)
elif type in scalars:
if type in scalars:

This comment has been minimized.

Copy link
@tpoterba

tpoterba Apr 23, 2019

Collaborator

why this change? it was clearer before, no?

This comment has been minimized.

Copy link
@tpoterba

tpoterba Apr 23, 2019

Collaborator

oh, for consistency. I'd prefer to change the others to elif instead

This comment has been minimized.

Copy link
@danking

danking Apr 23, 2019

Author Collaborator

pylint doesn't like the else elifs if not necessary. at first I hated it but now I prefer it.

This comment has been minimized.

Copy link
@tpoterba

tpoterba Apr 23, 2019

Collaborator

do you have a PEP reference?

This comment has been minimized.

Copy link
@danking

danking Apr 25, 2019

Author Collaborator

It's not in PEP-8, which quietly demonstrates both styles in an unrelated recommendation. In pylint it's called no-else-return. We can add this to our standard pylint rules if you prefer this style. At first I hated it. Now I've come to prefer the uniformity of if and the lack of indentation on the else case.

This comment has been minimized.

Copy link
@danking

danking Apr 25, 2019

Author Collaborator

I suppose if all of hail is written using the else-return style we should stay consistent.

This comment has been minimized.

Copy link
@patrick-schultz

patrick-schultz Apr 25, 2019

Collaborator

For what it's worth, I think standard C++ style is also to never put an else clause on an if statement that returns.

This comment has been minimized.

Copy link
@tpoterba

tpoterba Apr 25, 2019

Collaborator

I'm happy with either, but do have a preference for elif in this case. I'm fine with going with style guidelines, though.

I think I have somewhat mixed preferences, too. I'd prefer:

def foo(*args):
    if take_other_path:
        return blah(*args)
    ...
    lots of code
    ...
    return other_thing

to

def foo(*args):
    if take_other_path:
        return blah(*args)
    else:
        lots of code
        ...
        return other_thing

However, in code with several short conditionals, I prefer

def foo(a, b, c, ..., z):
    if a < 0:
        return a
    elif b < 0:
        return b
    ...
    else:
        return z

to

def foo(a, b, c, ..., z):
    if a < 0:
        return a
    if b < 0:
        return b
    ...
    return z

, since if you forget one of the "return" keywords, you get a piece of code that's much harder to debug.

@@ -420,12 +420,13 @@ def __getitem__(self, item):
"""
if isinstance(item, slice):
return self._slice(self.dtype, item.start, item.stop, item.step)
else:

This comment has been minimized.

Copy link
@tpoterba

tpoterba Apr 23, 2019

Collaborator

can revert this change, right?

This comment has been minimized.

Copy link
@danking

danking Apr 23, 2019

Author Collaborator

yeah but python style prefers no else or elif when each branch returns and it felt weird to move the else to my new if.

"""

if isinstance(item, str):
return self.map(lambda x: x[item])

This comment has been minimized.

Copy link
@tpoterba

tpoterba Apr 23, 2019

Collaborator

nice, super clean.

@tpoterba

This comment has been minimized.

Copy link
Collaborator

commented Apr 23, 2019

this is super

@tpoterba
Copy link
Collaborator

left a comment

request clarification on PEP style w.r.t. returns

@danking danking force-pushed the danking:super-array-expression branch from a3e5540 to f8e9624 Apr 25, 2019

@danking

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 25, 2019

@tpoterba @patrick-schultz Ok, I switched to elif style and made fixes so that the surrounding code also followed the elif style. I'd like to merge this and push off any further discussions to another PR. IMO, the no-else-return style is almost always nicer when I want to:

  • bind a variable half way through an if chain, or
  • have a complex condition (see the one with a while loop around 3400) that demands a nested if (I have to duplicate the common fall-through case)
if isinstance(etype, hl.tstruct):
return ArrayStructExpression(ir, type, indices, aggregations)
else:
raise NotImplementedError(type)

This comment has been minimized.

Copy link
@tpoterba

tpoterba Apr 26, 2019

Collaborator

these not implemented errors are wrong. You want these to get to line 3440 below to go through typ_to_expr.

with this in mind, the no-elif model is way nicer since you don't need to duplicate that logic!

@danking danking force-pushed the danking:super-array-expression branch 2 times, most recently from e52474d to 5928e94 May 6, 2019

@danking danking force-pushed the danking:super-array-expression branch from 589b14b to d15c85a May 9, 2019

@danking danking merged commit 256109d into hail-is:master May 9, 2019

1 check passed

ci-test success
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.