Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] freeze when necessary to avoid hashing issues #12265

Merged
merged 3 commits into from
Oct 28, 2022

Conversation

danking
Copy link
Contributor

@danking danking commented Oct 4, 2022

CHANGELOG: Fix long-standing bug wherein hl.agg.collect_as_set and hl.agg.counter error when applied to types which, in Python, are unhashable. For example, hl.agg.counter(t.list_of_genes) will not error when t.list_of_genes is a list. Instead, the counter dictionary will use FrozenList keys from the frozenlist package.

Hey @iris-garden ! I figured this was good reviewing practice for you and also a chance to see how we convert data to/from JSON and to/from the JVM (by way of this "encoded" representation which is a binary one). The details of that are not super important to this PR, but you might take a peek to understand the change.

The main issue here is that in Python, you can't write:

{[1]}

Because sets must contain "hashable" data. Python lists are not hashable because they're mutable. This is transitively a problem. For example, the following also fails with the same error because the list inside the tuple is mutable thus the tuple is not (safely) hashable.

{("tuples", "are", "immutable", ["lists", "are", "not"])}

Hail's internal language is fully immutable, so every type can be placed inside a set (or used as the keys of a dict). When we convert from Hail's internal representation to Python, we cannot use mutable types in hashable positions. Unfortunately, we also need to maintain backwards compatibility with the way the code currently works. You can see this pretty clearly in the difference between hl.agg.collect and hl.agg.collect_as_set:

t = hl.utils.range_table(1)
t = t.annotate(ls = [1, 2, 3])
collected_ls = t.aggregate(hl.agg.collect(t.ls))
collected_as_set_ls = t.aggregate(hl.agg.collect_as_set(t.ls))

collected_ls should be [[1, 2, 3]] whereas collected_as_set_ls necessarily uses hashable types: {frozenlist([1, 2, 3])}.

Things are particularly subtle with dictionaries whose keys must always be hashable and whose values need only be hashable if the dictionary itself must be hashable. For example:

t = hl.utils.range_table(1)
# we're trying to create {["hello"]: ["goodbye"]}, but that
# would fail because a python dictionary can't have a list
# as a key
t = t.annotate(x = hl.dict([(["hello"], ["goodbye"])]))
in_list = t.aggregate(hl.agg.collect(t.x))
in_set = t.aggregate(hl.agg.collect_as_set(t.x))

in_list will be [{frozenlist(["hello"]): ["goodbye"]}] (because we should be backwards compatible with the expectation that lists in value-position are mutable) but in_set will be {{frozenlist(["hello"]): frozenlist(["goodbye"])}} (because the dictionary is inside a set and therefore it, itself, must be hashable).


Aside: I also do an ugly subclass to put a type parameter on the FrozenList class from frozenlist, a Python library.


cc: @tpoterba , dang this was trickier than I hoped.

@danking
Copy link
Contributor Author

danking commented Oct 6, 2022

bump for review! I fixed the lint issues.

Copy link
Contributor

@tpoterba tpoterba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

womp, looked earlier but didn't review. Looks fine, don't see a way around the dict hack with less complexity than the hack.

Copy link
Contributor

@tpoterba tpoterba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question

def _convert_from_json(self, x):
return frozendict({self.key_type._convert_from_json_na(elt['key']): self.value_type._convert_from_json_na(elt['value']) for elt in x})
def _convert_from_json(self, x, _should_freeze: bool = False):
return frozendict({self.key_type._convert_from_json_na(elt['key'], _should_freeze=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, can this return a normal dict if we don't need to freeze?

def _convert_from_json(self, x):
return frozenset({self.element_type._convert_from_json_na(elt) for elt in x})
def _convert_from_json(self, x, _should_freeze: bool = False):
return frozenset({self.element_type._convert_from_json_na(elt, _should_freeze=True) for elt in x})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@danking
Copy link
Contributor Author

danking commented Oct 7, 2022

I added some types and also fixed a bug where we would fail to return the decoded version of an ndarray in one branch of the if.

CHANGELOG: Fix long-standing bug wherein `hl.agg.collect_as_set` and `hl.agg.counter` error when applied to types which, in Python, are unhashable. For example, `hl.agg.counter(t.list_of_genes)` will not error when `t.list_of_genes` is a list. Instead, the counter dictionary will use `FrozenList` keys from the `frozenlist` package.
@iris-garden iris-garden self-assigned this Oct 27, 2022
@danking danking merged commit 1c8188c into hail-is:main Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants