Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail][performance] Deduplicate inlined IRs in `annotate(**thing)` #6506

Merged
merged 6 commits into from Jun 28, 2019

Conversation

@tpoterba
Copy link
Collaborator

commented Jun 27, 2019

Benchmark:

@benchmark
def per_row_stats_star_star():
    mt = hl.read_matrix_table(resource('gnomad_dp_simulation.mt'))
    mt.annotate_rows(**hl.agg.stats(mt.x))._force_count_rows()

This branch:

running per_row_stats_star_star...
    run 1 took 14.53s
    run 2 took 16.56s
    run 3 took 15.05s
    Mean, Median: 15.38s, 15.05s

Master:

running per_row_stats_star_star...
    run 1 took 31.47s
    run 2 took 37.34s
    run 3 took 26.67s
    Mean, Median: 31.83s, 31.47s
[hail][performance] Deduplicate inlined IRs in `annotate(**thing)`
Benchmark:
```python
@benchmark
def per_row_stats_star_star():
    mt = hl.read_matrix_table(resource('gnomad_dp_simulation.mt'))
    mt.annotate_rows(**hl.agg.stats(mt.x))._force_count_rows()
```

This branch:
```
running per_row_stats_star_star...
    run 1 took 14.53s
    run 2 took 16.56s
    run 3 took 15.05s
    Mean, Median: 15.38s, 15.05s
```

Master:
```
running per_row_stats_star_star...
    run 1 took 31.47s
    run 2 took 37.34s
    run 3 took 26.67s
    Mean, Median: 31.83s, 31.47s
```
@tpoterba

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 27, 2019

cc @cseed @lfrancioli @konradjk

@@ -1176,6 +1177,20 @@ def __eq__(self, other):
other.init_op_args == self.init_op_args and \
other.seq_op_args == self.seq_op_args

def __hash__(self):
h = hash(self.agg_op)

This comment has been minimized.

Copy link
@akotlar

akotlar Jun 28, 2019

Collaborator

This prompted me to take an interesting detour; wondering why 31 and 37; if I understood, these mimic the Java hash function (apparently 31 is common), the equivalent of << 5 -1 and I suppose <<5 + 5 (37), and are used to more uniformly use the allocated space, to reduce risk of collision.

https://stackoverflow.com/questions/299304/why-does-javas-hashcode-in-string-use-31-as-a-multiplier

Is that right, or is that explanation incomplete Tim?

This comment has been minimized.

Copy link
@johnc1231

johnc1231 Jun 28, 2019

Contributor

Usually when I write a hash function I just make a tuple of the things I care about and call python's hash on that. Would the results of doing that be worse than building your own hash function like this?

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jun 28, 2019

Author Collaborator

that's fine, and probably easier! I may change it.

This comment has been minimized.

Copy link
@akotlar

akotlar Jun 28, 2019

Collaborator

So each individual hash already does something like this, but using 1000003 (http://effbot.org/zone/python-hash.htm)

If we're combining multiple hashes, do we get the same degree of entropy if we don't perform the shift in the addition, if our operands are hashes themselves?

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jun 28, 2019

Author Collaborator

@akotlar - I don't know too much about the particulars of hashing (Patrick is the person for that!) but using primes in this way seems to be pretty standard.

XOR seems to be better than addition, too, which is what I was using.

@@ -68,6 +68,9 @@ def _eq(self, other):
"""
return True

def __hash__(self):
return 31 + hash(str(self))

This comment has been minimized.

Copy link
@johnc1231

johnc1231 Jun 28, 2019

Contributor

Why add 31 here?

This comment has been minimized.

Copy link
@akotlar

akotlar Jun 28, 2019

Collaborator

This seems like it should be a *?

This comment has been minimized.

Copy link
@tpoterba

tpoterba Jun 28, 2019

Author Collaborator

shouldn't really matter -- I just wanted the hash of the IR to be different from the hash of the str.

tpoterba added some commits Jun 28, 2019

fix
fix

@danking danking merged commit 3bd837f into hail-is:master Jun 28, 2019

1 check passed

ci-test success
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.