Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compress/gzip: return different error for trailing garbage #61797

Open
nikaiw opened this issue Aug 7, 2023 · 5 comments
Open

compress/gzip: return different error for trailing garbage #61797

nikaiw opened this issue Aug 7, 2023 · 5 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@nikaiw
Copy link

nikaiw commented Aug 7, 2023

Hello,
Regarding #47809 , would you consider returning a different err ?
"gzip: invalid header" can be quite misleading when the error is actually because of trailing garbage.

@nikaiw nikaiw added the pkgsite label Aug 7, 2023
@ianlancetaylor ianlancetaylor changed the title compress/gzip - trailing garbage compress/gzip: return different error for trailing garbage Aug 7, 2023
@ianlancetaylor ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed pkgsite labels Aug 7, 2023
@ianlancetaylor
Copy link
Contributor

CC @dsnet

@dsnet
Copy link
Member

dsnet commented Aug 7, 2023

According to RFC 1952, section 2.2:

A gzip file consists of a series of "members" (compressed data sets).

Thus, when a gzip member terminates, the parser should correctly check whether there is a header for another gzip member. An "invalid header" error seems to be the right error given this grammar.

@cespare
Copy link
Contributor

cespare commented Aug 7, 2023

In addition to what @dsnet said, I think it's possible to use the current API to figure out whether there might be trailing data. If z is a gzip.Reader, you can call z.Multistream(false) to have the reader read a single member at a time. Then:

  • If you know that your data always consists of a single data stream, you can read it to the end. If there's still more data available to read after that, you know that there is trailing junk.
  • If your data may consist of multiple concatenated streams, then it isn't really possible to disambiguate corrupted (say, truncated) data from trailing junk. But you could still read as many data streams as possible and then, once you hit an error, decide whether you think you've got trailing junk or a corrupted chunk, perhaps based on what the decompressed data looks like or something.

(The general ambiguity described in the second bullet is just what @dsnet pointed out; it's why the compress/gzip package isn't in a position to know whether there is trailing junk or a different kind of corruption. But you may be able to tell based on what your data looks like.)

@nikaiw
Copy link
Author

nikaiw commented Aug 8, 2023

Thanks for the details, that is making sense and I don't want to complicate the code for nothing. I must say I would still love to have a distinguished error that would allow a developper to quickly be able to make the difference between an error regarding the first gzip header and the attempt at reading another gzip member from a multistream gzip.

I imagine the value of the error could be changed to reflect that just after we manage to read the first member.

// File is ok; check if there is another.

What are your thoughts?

Edit: Alternatively, if multistream is set to true, the error could be read as something such as "Invalid header in multi-stream gzip"

@rgmz
Copy link

rgmz commented Jun 14, 2024

According to RFC 1952, section 2.2:

A gzip file consists of a series of "members" (compressed data sets).

Thus, when a gzip member terminates, the parser should correctly check whether there is a header for another gzip member. An "invalid header" error seems to be the right error given this grammar.

@dsnet While the error is technically correct, I agree with @nikaiw that a having a distinct error would provide tremendous clarity. Something like invalid header in multistream gzip would make the cause much clearer, and make the solution much easier to find.

To quote my rationale from #679861:

When using the compress/gzip package to decompress gzip files, receiving a gzip: invalid header error can indicate two distinct possibilities.

  1. The file is not a valid gzip file.

    Example test case (click to expand)
    import (
      "compress/gzip"
      "io"
      "os"
      "testing"
    )
    
    func TestNotAValidGzipFile(t *testing.T) {
      f, err := os.Open("/tmp/python-3.12.3-amd64.exe") // Arbitrary example, any non-gzip file will suffice.
      if err != nil {
          t.Fatal(err)
      }
      defer f.Close()
    
      rc, err := gzip.NewReader(f)
      if err != nil {
          t.Fatal(err)
      }
      defer rc.Close()
    
      _, err = io.ReadAll(rc)
      if err != nil {
          t.Fatal(err)
      }
    }
    
    // === RUN   TestNotAValidGzipFile
    //  gzip_test.go:19: gzip: invalid header
  2. The file is a valid gzip file, but contains invalid trailing data.

    Example test case (click to expand)
    func TestValidGzipFileWithTrailingData(t *testing.T) {
      // Reproducer file. There are many examples of this.
      // https://github.com/udacity/self-driving-car-sim/blob/4b1f739ebda9ed4920fe895ee3677bd4ccb79218/Assets/Standard%20Assets/Environment/SpeedTree/Conifer/Conifer_Desktop.spm
      f, err := os.Open("/tmp/Conifer_Desktop.spm")
      if err != nil {
        t.Fatal(err)
      }
      defer f.Close()
      
      rc, err := gzip.NewReader(f)
      if err != nil {
        t.Fatal(err)
      }
      defer rc.Close()
      
      _, err = io.ReadAll(rc)
      if err != nil {
        t.Fatal(err)
      }
    }
    
    // === RUN   TestValidGzipFileWithTrailingData
    //  gzip_test.go:19: gzip: invalid header

Scenario 2 can be especially confusing because Go's implementation of compress/gzip rejects invalid trailing data, while many popular applications and languages do not. Hence, the ambiguity of this error can lead people to believe that Go is rejecting a valid gzip file.

Footnotes

  1. GitHub's search failed me when looking for an existing issue. I'm guessing I should close that in favour of this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants