Skip to content

[hail][performance] Deduplicate inlined IRs in annotate(**thing)#6506

Merged
danking merged 6 commits intohail-is:masterfrom
tpoterba:python-star-star
Jun 28, 2019
Merged

[hail][performance] Deduplicate inlined IRs in annotate(**thing)#6506
danking merged 6 commits intohail-is:masterfrom
tpoterba:python-star-star

Conversation

@tpoterba
Copy link
Contributor

Benchmark:

@benchmark
def per_row_stats_star_star():
    mt = hl.read_matrix_table(resource('gnomad_dp_simulation.mt'))
    mt.annotate_rows(**hl.agg.stats(mt.x))._force_count_rows()

This branch:

running per_row_stats_star_star...
    run 1 took 14.53s
    run 2 took 16.56s
    run 3 took 15.05s
    Mean, Median: 15.38s, 15.05s

Master:

running per_row_stats_star_star...
    run 1 took 31.47s
    run 2 took 37.34s
    run 3 took 26.67s
    Mean, Median: 31.83s, 31.47s

Benchmark:
```python
@benchmark
def per_row_stats_star_star():
    mt = hl.read_matrix_table(resource('gnomad_dp_simulation.mt'))
    mt.annotate_rows(**hl.agg.stats(mt.x))._force_count_rows()
```

This branch:
```
running per_row_stats_star_star...
    run 1 took 14.53s
    run 2 took 16.56s
    run 3 took 15.05s
    Mean, Median: 15.38s, 15.05s
```

Master:
```
running per_row_stats_star_star...
    run 1 took 31.47s
    run 2 took 37.34s
    run 3 took 26.67s
    Mean, Median: 31.83s, 31.47s
```
@tpoterba
Copy link
Contributor Author

cc @cseed @lfrancioli @konradjk

other.seq_op_args == self.seq_op_args

def __hash__(self):
h = hash(self.agg_op)
Copy link
Contributor

@akotlar akotlar Jun 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prompted me to take an interesting detour; wondering why 31 and 37; if I understood, these mimic the Java hash function (apparently 31 is common), the equivalent of << 5 -1 and I suppose <<5 + 5 (37), and are used to more uniformly use the allocated space, to reduce risk of collision.

https://stackoverflow.com/questions/299304/why-does-javas-hashcode-in-string-use-31-as-a-multiplier

Is that right, or is that explanation incomplete Tim?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually when I write a hash function I just make a tuple of the things I care about and call python's hash on that. Would the results of doing that be worse than building your own hash function like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine, and probably easier! I may change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So each individual hash already does something like this, but using 1000003 (http://effbot.org/zone/python-hash.htm)

If we're combining multiple hashes, do we get the same degree of entropy if we don't perform the shift in the addition, if our operands are hashes themselves?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akotlar - I don't know too much about the particulars of hashing (Patrick is the person for that!) but using primes in this way seems to be pretty standard.

XOR seems to be better than addition, too, which is what I was using.

return True

def __hash__(self):
return 31 + hash(str(self))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add 31 here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it should be a *?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't really matter -- I just wanted the hash of the IR to be different from the hash of the str.

@danking danking merged commit 3bd837f into hail-is:master Jun 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants