-
Notifications
You must be signed in to change notification settings - Fork 753
Description
New feature
Path objects currently have a .size() method that reports the file size in bytes. It would be nice to add a decompress: true argument to this for gzipped files in order to get the uncompressed size on disk. If a file is already uncompressed, it should probably spit out a warning and then give the file size?
This functionality is already present for other functions in the standard library, such as .countLines().
Use case
Sometimes it is useful to know the decompressed size, for example in order to set resource requirements. Creating an index for bwa-mem2 from a Fasta file, for example, requires 28x the uncompressed size in memory in order to complete - but the command can take a gzipped file as input, so it would be good to be able to get this without first decompressing the file (which means you then are storing two copies of the file in the work directory!)
Suggested implementation
I don't think this has to be resource-intensive, as obviously streaming the file from the head process is not ideal. gzipped files should have this information embedded in their header - it can be accessed near instantly on the command line with gzip -l file.gz, so streaming the whole file would only need to happen as a fallback?