[hail][performance] Deduplicate inlined IRs in annotate(**thing)#6506
[hail][performance] Deduplicate inlined IRs in annotate(**thing)#6506danking merged 6 commits intohail-is:masterfrom
annotate(**thing)#6506Conversation
Benchmark: ```python @benchmark def per_row_stats_star_star(): mt = hl.read_matrix_table(resource('gnomad_dp_simulation.mt')) mt.annotate_rows(**hl.agg.stats(mt.x))._force_count_rows() ``` This branch: ``` running per_row_stats_star_star... run 1 took 14.53s run 2 took 16.56s run 3 took 15.05s Mean, Median: 15.38s, 15.05s ``` Master: ``` running per_row_stats_star_star... run 1 took 31.47s run 2 took 37.34s run 3 took 26.67s Mean, Median: 31.83s, 31.47s ```
hail/python/hail/ir/ir.py
Outdated
| other.seq_op_args == self.seq_op_args | ||
|
|
||
| def __hash__(self): | ||
| h = hash(self.agg_op) |
There was a problem hiding this comment.
This prompted me to take an interesting detour; wondering why 31 and 37; if I understood, these mimic the Java hash function (apparently 31 is common), the equivalent of << 5 -1 and I suppose <<5 + 5 (37), and are used to more uniformly use the allocated space, to reduce risk of collision.
https://stackoverflow.com/questions/299304/why-does-javas-hashcode-in-string-use-31-as-a-multiplier
Is that right, or is that explanation incomplete Tim?
There was a problem hiding this comment.
Usually when I write a hash function I just make a tuple of the things I care about and call python's hash on that. Would the results of doing that be worse than building your own hash function like this?
There was a problem hiding this comment.
that's fine, and probably easier! I may change it.
There was a problem hiding this comment.
So each individual hash already does something like this, but using 1000003 (http://effbot.org/zone/python-hash.htm)
If we're combining multiple hashes, do we get the same degree of entropy if we don't perform the shift in the addition, if our operands are hashes themselves?
There was a problem hiding this comment.
@akotlar - I don't know too much about the particulars of hashing (Patrick is the person for that!) but using primes in this way seems to be pretty standard.
XOR seems to be better than addition, too, which is what I was using.
| return True | ||
|
|
||
| def __hash__(self): | ||
| return 31 + hash(str(self)) |
There was a problem hiding this comment.
This seems like it should be a *?
There was a problem hiding this comment.
shouldn't really matter -- I just wanted the hash of the IR to be different from the hash of the str.
Benchmark:
This branch:
Master: