Support compression presets for LZMA compression backend #726

huddlej · 2021-05-24T17:00:17Z

Context
Augur's current backend for transparently reading/writing compressed data is xopen. xopen uses the standard Python LZMA module to handle xz files, but it only uses the filename and mode arguments. Python’s LZMA library also includes support for a preset argument that controls the amount of CPU and memory used for compression. Although Augur’s io module passes through extra keyword arguments to the internal compression tool, this doesn’t help us because xopen does not support LZMA’s presets. xopen only passes through its compression argument for gzip files.

Description

To get better performance in our workflows, we'd like to support different compression presets for LZMA compression in our io.py module.

Examples

# Write a file with LZMA compression using a reduced CPU/memory preset.
# This example requires no change to the Augur API but does require a change in the xopen
# backend or the replacement of this backend with another library (including one of our own).
with open_file("sequences.fasta.xz", "w", compression_level=2) as out_handle:
    write_sequences(sequences, out_handle)

# Write sequences with LZMA compression using a custom preset defined in the environment.
os.environ["AUGUR_COMPRESSION_LEVEL"] = "2"
write_sequences(sequences, "sequences.fasta.xz")

Possible solutions

Switch from xz to gzip. This is a quick fix in that it requires no changes to Augur. It will introduce some breaking changes to workflows that depend on our hosting xz files. This isn't really an acceptable solution.
Modify Augur to use an I/O library that parameterizes compression levels for different compression backends like LZMA. This is a long-term fix and will be backward compatible. Some libraries to consider are:
1. fsspec: to use this module, it seems we would need to register our own LZMA compression function that hardcodes a compression preset. The default LZMA backend uses the default preset from the Python lzma module and there doesn’t appear to be an interface to pass kwargs through fsspec.open to the compression backend.
2. smart_open: as with fsspec, this module would require us to define our own LZMA backend (there isn’t one supported by default) that hardcodes the compression preset we want to use.
Submit a patch to xopen that supports passing the compresslevel argument from xopen.xopen through to the LZMA backend as the preset. This is a long-term fix and not guaranteed to work if the patch isn’t accepted.
Implement our own compression backend that handles the formats we want to support (.gz and .xz for now) with the interface we want. This implementation could borrow code or ideas from xopen without using the whole third-party library. This solution gives us ultimate freedom but also gives us additional responsibility for maintaining a compression backend. We would also want to consider how our custom implementation would interact with other I/O libraries that we may want to use in the future like fsspec or smart_open to add support for other transports to all Augur I/O.

However we implement support for configuration of compression presets, we also need to provide an interface for users to configure these presets without modifying code. Augur lacks a proper configuration file, but it does rely on environmental variables to modify program behavior. Examples of this include JSON indentation and the recursion limit for tree traversals. We could implement a similar AUGUR_COMPRESSION_LEVEL or AUGUR_COMPRESSION_PRESET variable that the open_file module would use when it is defined.

We should then set the default compression preset (perhaps per format) to the value we want (e.g., compression=2 for LZMA), so we do not have to change anything in our workflows other than upgrading Augur.

The text was updated successfully, but these errors were encountered:

tsibley · 2022-03-21T18:41:34Z

Upstream support in xopen (option 3) seems like the best option here, assuming a patch is accepted. Seems to me very worth submitting a patch upstream earlier than later and then we can loop back around to it when we consider the rest of the implementation of this issue.

tsibley · 2022-03-21T18:49:16Z

A related issue which might push us to not using the lzma stdlib for .xz files is supporting multi-threaded compression. But again, this could also fit into a patch for xopen, albeit a larger one.

tsibley · 2022-03-21T19:00:47Z

PR with a small patch for xopen to support compresslevel → pycompression/xopen#102

And multi-threaded xz compression in xopen is already being worked on pycompression/xopen#101.

huddlej · 2022-03-22T19:51:33Z

Very nice! Thank you, @tsibley!

fanninpm · 2022-03-23T16:48:20Z

Version 1.5 of xopen was released today (2022-03-23), incorporating the two changes mentioned in a previous comment. Even if Augur doesn't opt for multithreaded compression, its performance may increase from either offloading the (de)compression onto a separate process or using a lighter compression level by default, or both.

huddlej added the enhancement New feature or request label May 24, 2021

victorlin assigned huddlej and victorlin Mar 30, 2022

fanninpm mentioned this issue Jun 24, 2022

[PERF, v2] Specify Compression Level nextstrain/nextclade#888

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support compression presets for LZMA compression backend #726

Support compression presets for LZMA compression backend #726

huddlej commented May 24, 2021 •

edited

tsibley commented Mar 21, 2022

tsibley commented Mar 21, 2022

tsibley commented Mar 21, 2022 •

edited

huddlej commented Mar 22, 2022

fanninpm commented Mar 23, 2022 •

edited

Support compression presets for LZMA compression backend #726

Support compression presets for LZMA compression backend #726

Comments

huddlej commented May 24, 2021 • edited

tsibley commented Mar 21, 2022

tsibley commented Mar 21, 2022

tsibley commented Mar 21, 2022 • edited

huddlej commented Mar 22, 2022

fanninpm commented Mar 23, 2022 • edited

huddlej commented May 24, 2021 •

edited

tsibley commented Mar 21, 2022 •

edited

fanninpm commented Mar 23, 2022 •

edited