Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail] Implement concordance in Python. #6224

Merged
merged 4 commits into from
May 31, 2019

Conversation

tpoterba
Copy link
Contributor

This is unfortunately about 2x slower -- partly due to the fact
that the column + global concordance calculations are not fused,
and partly because the AggArrayPerElement stuff seems pretty
slow right now and is dragging down the per-sample concordance.

This is unfortunately about 2x slower -- partly due to the fact
that the column + global concordance calculations are not fused,
and partly because the AggArrayPerElement stuff seems pretty
slow right now and is dragging down the per-sample concordance.
@@ -285,9 +285,51 @@ def has_field_of_type(name, dtype):
return mt.annotate_rows(**{name: result})


def concordance2(left, right, *, _localize_global_statistics=True):
print('conc2')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The print is clearly debugging code. Is concordance2 itself also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops! Yes, the whole function can be removed.

Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one thought, which you're free to ignore. Otherwise looks good.


lit = hl.literal(included, dtype=hl.tset(hl.tstr))
left = left.filter_cols(lit.contains(left.col_key[0]))
right = right.filter_cols(lit.contains(right.col_key[0]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be faster using semi_join_cols. I assume the semi-join is an ordered merge, so linear, while this is n log n, doing a binary search for each column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

semi join can't do an ordered merge -- we don't store the columns ordered

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh of course. Never mind!

@danking danking merged commit 9d103e1 into hail-is:master May 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants