Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support compression presets for LZMA compression backend #726

Open
huddlej opened this issue May 24, 2021 · 5 comments
Open

Support compression presets for LZMA compression backend #726

huddlej opened this issue May 24, 2021 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@huddlej
Copy link
Contributor

huddlej commented May 24, 2021

Context
Augur's current backend for transparently reading/writing compressed data is xopen. xopen uses the standard Python LZMA module to handle xz files, but it only uses the filename and mode arguments. Python’s LZMA library also includes support for a preset argument that controls the amount of CPU and memory used for compression. Although Augur’s io module passes through extra keyword arguments to the internal compression tool, this doesn’t help us because xopen does not support LZMA’s presets. xopen only passes through its compression argument for gzip files.

Description

To get better performance in our workflows, we'd like to support different compression presets for LZMA compression in our io.py module.

Examples

# Write a file with LZMA compression using a reduced CPU/memory preset.
# This example requires no change to the Augur API but does require a change in the xopen
# backend or the replacement of this backend with another library (including one of our own).
with open_file("sequences.fasta.xz", "w", compression_level=2) as out_handle:
    write_sequences(sequences, out_handle)

# Write sequences with LZMA compression using a custom preset defined in the environment.
os.environ["AUGUR_COMPRESSION_LEVEL"] = "2"
write_sequences(sequences, "sequences.fasta.xz")

Possible solutions

  1. Switch from xz to gzip. This is a quick fix in that it requires no changes to Augur. It will introduce some breaking changes to workflows that depend on our hosting xz files. This isn't really an acceptable solution.

  2. Modify Augur to use an I/O library that parameterizes compression levels for different compression backends like LZMA. This is a long-term fix and will be backward compatible. Some libraries to consider are:

    1. fsspec: to use this module, it seems we would need to register our own LZMA compression function that hardcodes a compression preset. The default LZMA backend uses the default preset from the Python lzma module and there doesn’t appear to be an interface to pass kwargs through fsspec.open to the compression backend.
    2. smart_open: as with fsspec, this module would require us to define our own LZMA backend (there isn’t one supported by default) that hardcodes the compression preset we want to use.
  3. Submit a patch to xopen that supports passing the compresslevel argument from xopen.xopen through to the LZMA backend as the preset. This is a long-term fix and not guaranteed to work if the patch isn’t accepted.

  4. Implement our own compression backend that handles the formats we want to support (.gz and .xz for now) with the interface we want. This implementation could borrow code or ideas from xopen without using the whole third-party library. This solution gives us ultimate freedom but also gives us additional responsibility for maintaining a compression backend. We would also want to consider how our custom implementation would interact with other I/O libraries that we may want to use in the future like fsspec or smart_open to add support for other transports to all Augur I/O.

However we implement support for configuration of compression presets, we also need to provide an interface for users to configure these presets without modifying code. Augur lacks a proper configuration file, but it does rely on environmental variables to modify program behavior. Examples of this include JSON indentation and the recursion limit for tree traversals. We could implement a similar AUGUR_COMPRESSION_LEVEL or AUGUR_COMPRESSION_PRESET variable that the open_file module would use when it is defined.

We should then set the default compression preset (perhaps per format) to the value we want (e.g., compression=2 for LZMA), so we do not have to change anything in our workflows other than upgrading Augur.

@huddlej huddlej added the enhancement New feature or request label May 24, 2021
@tsibley
Copy link
Member

tsibley commented Mar 21, 2022

Upstream support in xopen (option 3) seems like the best option here, assuming a patch is accepted. Seems to me very worth submitting a patch upstream earlier than later and then we can loop back around to it when we consider the rest of the implementation of this issue.

@tsibley
Copy link
Member

tsibley commented Mar 21, 2022

A related issue which might push us to not using the lzma stdlib for .xz files is supporting multi-threaded compression. But again, this could also fit into a patch for xopen, albeit a larger one.

@tsibley
Copy link
Member

tsibley commented Mar 21, 2022

PR with a small patch for xopen to support compresslevelpycompression/xopen#102

And multi-threaded xz compression in xopen is already being worked on pycompression/xopen#101.

@huddlej
Copy link
Contributor Author

huddlej commented Mar 22, 2022

Very nice! Thank you, @tsibley!

@fanninpm
Copy link

fanninpm commented Mar 23, 2022

Version 1.5 of xopen was released today (2022-03-23), incorporating the two changes mentioned in a previous comment. Even if Augur doesn't opt for multithreaded compression, its performance may increase from either offloading the (de)compression onto a separate process or using a lighter compression level by default, or both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Status: Prioritized
Development

No branches or pull requests

4 participants