Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO::Uncompress::Unzip::getHeaderInfo() should provide (more) information in transparent mode #35

Open
tlhackque opened this issue Oct 29, 2021 · 2 comments
Assignees
Labels
documentation Documentation issue enhancement New feature or request

Comments

@tlhackque
Copy link

While I'm reporting issues with IO::Uncompress::Unzip:

When processing files that may turn out not to be compressed, I try to have a common code path.

getHeaderInfo() simply returns undef. I use the following code, which really belongs in getHeaderInfo.

     my $member = $in->getHeaderInfo();
     if( !defined $member ) {
        my $size = ( stat $filename )[7] || 0;
        $size = U64->new( int( $size / U64::HI_1 ), int( $size % U64::HI_1 ) );
        $member = { Name               => $filename,
                  Time               => ( ( stat _ )[9] ) || 0,
                  Type               => $type,
                  UncompressedLength => $size,
                  MethodName         => 'uncompressed',
                };
    }

Providing this pseudo-header metadata allows the following code to use @member->{*} to report filename, size, time, etc. (If this looks odd, in real life it's a subroutine...and $type is 'text', or if binmode has been called, data)

Note that the code in Uncompress::Base sets Name to undef and Type to 'plain', expecting a caller of getHeaderInfo to receive that - even though getHeaderInfo in fact returns undef.

I'm not especially happy about using the U64 method, but that's what getHeaderInfo returns for compressed files.

Note that what's returned for compressed files is not well documented - it should be.

This method returns either a hash reference (in scalar context) or a
list or hash references (in array context) that contains information
about each of the header fields in the compressed data stream(s).

requires a dive into the internals to figure out how to turn "information about" into specific usable data.

And it would be nice if u64::new did the math to split a 64-bit int (or on 32-bit systems, double) into the required upper and lower 1/2s.

@pmqs pmqs self-assigned this Oct 30, 2021
@pmqs pmqs added enhancement New feature or request documentation Documentation issue labels Oct 30, 2021
@pmqs
Copy link
Owner

pmqs commented Oct 30, 2021

Tidying getHeaderInfo has been on my TODO list for a while. Not been any demand to address it, until now.

Apart from the uncompressed use-case you have highlighted, it is mostly a case of documenting all the fields populated.

Agree about the use of U64 -- I really don't like it myself, but when the module was written the 32-bit OS was still predominant and I needed a way to guarantee support for 64-bit values.

I'll have a think about how best to change getHeaderInfo so that it returns native Perl inst instead/as well as the U64 objects.

@tlhackque
Copy link
Author

I understand why you have U64 - it's a quite reasonable solution. My comment was more that since it's undocumented, it's awkard for user code to peek inside to use it. The reaching inside is what I wasn't happy about.

As for alternatives:

IEEE double gives 53 significand bits - not quite 64. quad gives 113. IIRC, native Perl switches to double when it feels like it.
(And then there are non-IEEE platforms - mostly older, like VAX.) So I'm not sure you can rely on native scalars giving precise sizes with 64-bit file pointers.

There's Math::Int64 that provides 64-bit int objects no matter how Perl is compiled. And if you have XS code, it his a C API.

But a possibly better choice might be Math::BigInt - it's a bit of overkill for this application, but it is a core module (since 5.6.1 according to corelist) with both PP and XS - so avoids adding a visible dependency to your modules.

Since U64 has been undocumented, I wouldn't object to deprecating or removing it if you switch to something like that. Don't know how many other people might have done the archaeology, but switching to either of those would take minimal effort - and since both overload the usual operators, would simplify any code that touched them.

The main reason I use getHeaderInfo is for logging - it's helpful to know what was processed when looking at application performance... With large input files , it can also be helpful as the basis for sending updates to a progress meter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Documentation issue enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants