Skip to content

Add decompress: true option to Path.size() method #6086

@prototaxites

Description

@prototaxites

New feature

Path objects currently have a .size() method that reports the file size in bytes. It would be nice to add a decompress: true argument to this for gzipped files in order to get the uncompressed size on disk. If a file is already uncompressed, it should probably spit out a warning and then give the file size?

This functionality is already present for other functions in the standard library, such as .countLines().

Use case

Sometimes it is useful to know the decompressed size, for example in order to set resource requirements. Creating an index for bwa-mem2 from a Fasta file, for example, requires 28x the uncompressed size in memory in order to complete - but the command can take a gzipped file as input, so it would be good to be able to get this without first decompressing the file (which means you then are storing two copies of the file in the work directory!)

Suggested implementation

I don't think this has to be resource-intensive, as obviously streaming the file from the head process is not ideal. gzipped files should have this information embedded in their header - it can be accessed near instantly on the command line with gzip -l file.gz, so streaming the whole file would only need to happen as a fallback?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions