Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] import_table: simplify code and improve speed #11782

Merged
merged 2 commits into from
Apr 22, 2022

Conversation

danking
Copy link
Contributor

@danking danking commented Apr 20, 2022

CHANGELOG: hl.import_table is up to twice as fast for small tables.

The big change is optimizing for the single file, no filters case in which
we need not scan for the first extant row, that row must be in the first
partition, if it exists at all. Unfortunately there is no zero-RPC way to
determine the number of partitions in a table, so I must catch an error
about the lack of a zeroth partition.

I also did some refactoring:

  1. Move some functions to a utility file and add lots of indents and newlines to make them readable.
  2. Use hl.format for constructing strings.
  3. Make should_filter_line into should_remove_line for clarity of name.
  4. Modify should_remove_line to use short-circuiting and/or instead of array folds.
  5. Modify should_remove_line to indicate (via returning None) when there are no filters enabled.
  6. Add types.
  7. Fix a bug where we assumed that .collect()[0] would be None if there were no values in the table. (It raises an error)
  8. Deduplicate hail.utils.deduplicate (haha: I mean, there is already code for doing field dedupe)

Daniel King added 2 commits April 20, 2022 12:45
CHANGELOG: `hl.import_table` is up to twice as fast for small tables.

I mostly abstracted things and simplified code. One big change is optimizing
for the single file, no filters case in which we need not scan for the first
extant row, that row *must* be in the first partition, if it exists at all.
@tpoterba
Copy link
Contributor

can give this one to Chris, who's worked on this most recently

Copy link
Collaborator

@chrisvittal chrisvittal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@danking danking merged commit 507cdbb into hail-is:main Apr 22, 2022
CDiaz96 pushed a commit to CDiaz96/hail that referenced this pull request Apr 27, 2022
* [query] import_table: simplify code and improve speed

CHANGELOG: `hl.import_table` is up to twice as fast for small tables.

I mostly abstracted things and simplified code. One big change is optimizing
for the single file, no filters case in which we need not scan for the first
extant row, that row *must* be in the first partition, if it exists at all.

* handle empty files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants