-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support compression presets for LZMA compression backend #726
Comments
Upstream support in |
A related issue which might push us to not using the |
PR with a small patch for xopen to support And multi-threaded xz compression in xopen is already being worked on pycompression/xopen#101. |
Very nice! Thank you, @tsibley! |
Version 1.5 of xopen was released today (2022-03-23), incorporating the two changes mentioned in a previous comment. Even if Augur doesn't opt for multithreaded compression, its performance may increase from either offloading the (de)compression onto a separate process or using a lighter compression level by default, or both. |
Context
Augur's current backend for transparently reading/writing compressed data is xopen. xopen uses the standard Python LZMA module to handle xz files, but it only uses the
filename
andmode
arguments. Python’s LZMA library also includes support for apreset
argument that controls the amount of CPU and memory used for compression. Although Augur’sio
module passes through extra keyword arguments to the internal compression tool, this doesn’t help us becausexopen
does not support LZMA’s presets.xopen
only passes through itscompression
argument for gzip files.Description
To get better performance in our workflows, we'd like to support different compression presets for LZMA compression in our
io.py
module.Examples
Possible solutions
Switch from xz to gzip. This is a quick fix in that it requires no changes to Augur. It will introduce some breaking changes to workflows that depend on our hosting xz files. This isn't really an acceptable solution.
Modify Augur to use an I/O library that parameterizes compression levels for different compression backends like LZMA. This is a long-term fix and will be backward compatible. Some libraries to consider are:
lzma
module and there doesn’t appear to be an interface to pass kwargs throughfsspec.open
to the compression backend.Submit a patch to
xopen
that supports passing thecompresslevel
argument fromxopen.xopen
through to the LZMA backend as thepreset
. This is a long-term fix and not guaranteed to work if the patch isn’t accepted.Implement our own compression backend that handles the formats we want to support (
.gz
and.xz
for now) with the interface we want. This implementation could borrow code or ideas fromxopen
without using the whole third-party library. This solution gives us ultimate freedom but also gives us additional responsibility for maintaining a compression backend. We would also want to consider how our custom implementation would interact with other I/O libraries that we may want to use in the future like fsspec or smart_open to add support for other transports to all Augur I/O.However we implement support for configuration of compression presets, we also need to provide an interface for users to configure these presets without modifying code. Augur lacks a proper configuration file, but it does rely on environmental variables to modify program behavior. Examples of this include JSON indentation and the recursion limit for tree traversals. We could implement a similar
AUGUR_COMPRESSION_LEVEL
orAUGUR_COMPRESSION_PRESET
variable that theopen_file
module would use when it is defined.We should then set the default compression preset (perhaps per format) to the value we want (e.g.,
compression=2
for LZMA), so we do not have to change anything in our workflows other than upgrading Augur.The text was updated successfully, but these errors were encountered: