Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: archive/zip: add File.OpenRaw, Writer.CreateRaw, Writer.Copy #34974

Open
saracen opened this issue Oct 17, 2019 · 40 comments
Open

proposal: archive/zip: add File.OpenRaw, Writer.CreateRaw, Writer.Copy #34974

saracen opened this issue Oct 17, 2019 · 40 comments

Comments

@saracen
Copy link

@saracen saracen commented Oct 17, 2019

If you have a compressed file, know its uncompressed size and crc32 checksum, it'd be nice to be able to add it to an archive as is.

I have three current use-cases for this:

  • Repackaging a zip file, removing or adding files, without incurring the associated compression overheads for files that already exist (somewhat achieving #15626)
  • Compress first and include later based on whether the compressed size was smaller than the original, without having to perform the compression twice.
  • Support concurrent compression: compress many files concurrently and then copy the already compressed files to the archive (similar to Apache Commons Compress' ParallelScatterZipCreator)

I see three different ways we could achieve this:

  1. Create a zip.CreateHeaderRaw(fh *FileHeader) (io.Writer, error) function, that uses the FileHeader's CRC32 and UncompressedSize64 fields set by the user.
  2. Use the existing CreateHeader(fh *FileHeader) (io.Writer, error) function, but have a new FileHeader field that indicates we're going to write already compressed data (and then use CRC32 and UncompressedSize64 fields set by the user)
  3. Use the existing CreateHeader(fh *FileHeader) (io.Writer, error) function, but if CompressedSize64 has already been set, assume that data written is already compressed.

I'm going to assume that option 3 would be a no-go, because existing code might suddenly break if a user has already set the CompressedSize for whatever reason, but hopefully the other options are viable.

@julieqiu
Copy link
Contributor

@julieqiu julieqiu commented Oct 18, 2019

/cc @dsnet

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 19, 2019

Change https://golang.org/cl/202217 mentions this issue: archive/zip: support adding raw files

@saracen
Copy link
Author

@saracen saracen commented Oct 19, 2019

Added change that uses option 1 (CreateHeaderRaw) because it feels weird to use a field in FileHeader solely for archiving as that structure is also used when reading an archive.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 19, 2019

This is going to be new API, so turning into a proposal.

@dsnet Please add any thoughts you may have about this functionality. Thanks.

@ianlancetaylor ianlancetaylor changed the title archive/zip: add already compressed files proposal: archive/zip: add already compressed files Oct 19, 2019
@gopherbot gopherbot added this to the Proposal milestone Oct 19, 2019
@gopherbot gopherbot added the Proposal label Oct 19, 2019
@rsc rsc added this to Incoming in Proposals Dec 4, 2019
@escholtz
Copy link
Contributor

@escholtz escholtz commented Dec 16, 2019

The project I'm working on would appreciate this functionality for performance reasons. We need to replace a single file within a zip. The DEFLATE tax is high and wasteful since we decode/encode without needing or using the intermediate data.

I like @rsc's Copy implementation in https://github.com/rsc/zipmerge. It might not be quite as flexible as the other proposals, but I think it fits well within the existing API and supports efficiently creating a new zip from an existing zip while allowing the user to replace/delete/update specific files.

// Copy copies the file f (obtained from a Reader) into w.
// It copies the compressed form directly.
func (w *Writer) Copy(f *File) error {
	dataOffset, err := f.DataOffset()
	if err != nil {
		return err
	}
	if err := w.closeLastWriter(); err != nil {
		return err
	}

	fh := f.FileHeader
	h := &header{
		FileHeader: &fh,
		offset:     uint64(w.cw.count),
	}
	fh.Flags |= 0x8 // we will write a data descriptor
	w.dir = append(w.dir, h)

	if err := writeHeader(w.cw, &fh); err != nil {
		return err
	}

	r := io.NewSectionReader(f.zipr, dataOffset, int64(f.CompressedSize64))
	if _, err := io.Copy(w.cw, r); err != nil {
		return err
	}

	return writeDesc(w.cw, &fh)
}

Here are the latest compress/flate benchmarks from master at level 5 which archive/zip uses by default (Intel Xeon W-2135 CPU @ 3.70GHz × 6):

goos: linux
goarch: amd64
pkg: compress/flate
BenchmarkDecode/Digits/Level5/1e4-12         	    6530	    164347 ns/op	  60.85 MB/s	   40598 B/op	       7 allocs/op
BenchmarkDecode/Digits/Level5/1e5-12         	     960	   1131532 ns/op	  88.38 MB/s	   40842 B/op	      13 allocs/op
BenchmarkDecode/Digits/Level5/1e6-12         	     117	   9673619 ns/op	 103.37 MB/s	   44800 B/op	      79 allocs/op
BenchmarkDecode/Newton/Level5/1e4-12         	    8076	    164681 ns/op	  60.72 MB/s	   41114 B/op	      17 allocs/op
BenchmarkDecode/Newton/Level5/1e5-12         	    1653	    765625 ns/op	 130.61 MB/s	   44283 B/op	      31 allocs/op
BenchmarkDecode/Newton/Level5/1e6-12         	     180	   6408771 ns/op	 156.04 MB/s	   61144 B/op	     156 allocs/op
BenchmarkEncode/Digits/Level5/1e4-12         	    5710	    215617 ns/op	  46.38 MB/s
BenchmarkEncode/Digits/Level5/1e5-12         	     370	   3232511 ns/op	  30.94 MB/s
BenchmarkEncode/Digits/Level5/1e6-12         	      32	  34852520 ns/op	  28.69 MB/s
BenchmarkEncode/Newton/Level5/1e4-12         	    4774	    238334 ns/op	  41.96 MB/s
BenchmarkEncode/Newton/Level5/1e5-12         	     357	   2974790 ns/op	  33.62 MB/s
BenchmarkEncode/Newton/Level5/1e6-12         	      39	  30595764 ns/op	  32.68 MB/s
PASS
@saracen
Copy link
Author

@saracen saracen commented Dec 16, 2019

@escholtz I agree that having a (*Writer) Copy(f *File) would be nice, but as you said, it isn't quite as flexible.

For comparison, copying a file with the patch I've submitted would look something like this:

func Copy(zw *zip.Writer, zf io.ReaderAt, f *zip.File) error {
	offset, err := f.DataOffset()
	if err != nil {
		return err
	}

	w, err := zw.CreateHeaderRaw(&f.FileHeader)
	if err != nil {
		return err
	}

	_, err = io.Copy(w, io.NewSectionReader(zf, offset, int64(f.CompressedSize64)))

	return err
}

Having Copy() would be great, but if for some reason only one solution could be added, I'd prefer it to be something along the lines of CreateHeaderRaw.

For the other use cases originally mentioned, I've written https://github.com/saracen/fastzip that takes advantage of CreateHeaderRaw also (but I've had to vendor archive/zip and include my modification for it to work).

@klauspost
Copy link
Contributor

@klauspost klauspost commented Jan 30, 2020

Implemented in klauspost/compress#214

saracen added a commit to saracen/fastzip that referenced this issue Jun 7, 2020
klauspost kindly implemented the CreateHeaderRaw proposal
(golang/go#34974) in
klauspost/compress#214. This repository was vendoring
the patched version (https://go-review.googlesource.com/c/go/+/202217/)
but having it supported in a well maintained library already used by fastzip is
a plus.
@rsc
Copy link
Contributor

@rsc rsc commented Jun 10, 2020

I vaguely remember discussing this with @rogpeppe years ago, but I can't find the discussion. Any clearer memory, Roger?

@rsc rsc moved this from Incoming to Active in Proposals Jun 10, 2020
@rsc
Copy link
Contributor

@rsc rsc commented Jun 10, 2020

My memory is coming back. I think what I remember @rogpeppe and I talking about was the ability to modify an existing zip file by appending new files to it.

@saracen
Copy link
Author

@saracen saracen commented Jun 10, 2020

@rsc https://groups.google.com/g/golang-dev/c/ZSubqlF2G4k?pli=1 this might be the discussion, and yeah, was related to appending.

@rsc
Copy link
Contributor

@rsc rsc commented Jun 17, 2020

Adding a whole new Create method just to set this extra bit seems unfortunate - what if there is a second bit we want control over later? Add four new methods? And so on.

Probably better to add a new bool in the CreateHeader. Perhaps WriteRaw bool?

@saracen
Copy link
Author

@saracen saracen commented Jun 18, 2020

Probably better to add a new bool in the CreateHeader. Perhaps WriteRaw bool?

@rsc I originally preferred this too. The reason I didn't for my implementation is because FileHeader is also used when reading an archive too. It felt odd having a WriteRaw bool as part of that structure in the context of reads. Is this not an issue?

@rsc
Copy link
Contributor

@rsc rsc commented Jun 24, 2020

@saracen it seems like less of an issue than an exponential blowup of create methods. The NonUTF8 field is probably most analogous, although it does get set in the reader.

I suppose the most interesting analogy is whether there's a need to get the raw bytes from the reader as well. There at least we could add an OpenRaw method to zip.File. It's much harder to do that in what CreateHeader returns, because it's an io.Writer.

I suppose another way to store compressed files would be to register a no-op compressor.

Another way would be to have a method on zip.Writer: WriteRaw(bool) to set whether the compressors should be avoided. You'd z.WriteRaw(true) before calling CreateHeader.

Does any of these sound like a winner to anyone?

@saracen
Copy link
Author

@saracen saracen commented Jun 24, 2020

suppose the most interesting analogy is whether there's a need to get the raw bytes from the reader as well.

@rsc At the moment, I believe there's two ways to access the raw bytes:

  • Open with a ReaderAt, and then use file.Offset and file.CompressedSize64 to access and read a specific section of the archive.
  • Register a decompressor and do whatever custom logic to the raw bytes.

I suppose another way to store compressed files would be to register a no-op compressor.

This sounds quite good, especially since it's similar to how you can handle raw reads by registering a decompressor.

Using a zip.NoOpCompressor would need to mean that the CompressedSize64 and CRC32 fields of the header are left unmodified when the header is closed:

zw := zip.NewWriter(archive)
zw.RegisterCompressor(zip.Deflate, zip.NoOpCompressor)

hdr := &zip.FileHeader{
  Name: "file.txt",

  // the following fields would be left untouched because
  // the writer detects that zip.NoOpCompressor is being used
  CRC32: 0xffffffff,
  CompressedSize64: 1024,
}

w, _ := zw.CreateHeader(hdr) (io.Writer, error)

// bytes are copied to the zip archive unmodified
io.Copy(w, buf)
@rsc
Copy link
Contributor

@rsc rsc commented Jul 8, 2020

The problem with the no-op compressor is that the uncompressed size would be wrong. That would have to be left alone. It's probably a little too subtle.

Given that you can already read the compressed data directly, as @saracen points out, maybe it's fine to focus on writing.

What if pre-filling CompressedSize64 (> 0) in the FileHeader means that the writes are expected to be pre-compressed? In that case CRC32 has to be filled in too, although zero is a valid CRC32 so we can't double-check that. But I believe 0 is not a valid CompressedSize64, and in any event it's not an interesting one.

In essence, CompressedSize64 = 0 means "I don't know, you compress it for me".

Edit: I see that this was @saracen's suggestion 3 in the original comment at the top.

@rsc
Copy link
Contributor

@rsc rsc commented Jul 15, 2020

@saracen, @escholtz, @klauspost, any thoughts about the most recent suggestion (previous comment)?

@dsnet
Copy link
Member

@dsnet dsnet commented Jul 15, 2020

According RFC1951, the compression stream requires at least one block (with the BFINAL bit set). Therefore, it is impossible to have a compressed DEFLATE stream that is zero-bytes in length.

However, the ZIP file format allows for a number of different compression methods to be used (DEFLATE happens to be the most common by far). I think we should think about 1) whether we want to support other compression methods, and 2) whether we ever want to support other compression format that may theoretically have a zero-length output for an empty input.

@escholtz
Copy link
Contributor

@escholtz escholtz commented Jul 15, 2020

I've been hesitant to comment because I didn't want to hijack @saracen's original proposal.

For my use case, the func (w *Writer) Copy(f *File) error implementation would be sufficient. I think it fits cleanly with the existing package exports and still maintains simplicity.

I agree that having both CreateHeader and CreateHeaderRaw does not feel like a clean standard library package.

Is there precedent in the standard library for using the CompressedSize64 approach?

@saracen
Copy link
Author

@saracen saracen commented Jul 15, 2020

For my use case, func (w *Writer) Copy(f *File) error doesn't help, but I can see its usefulness and how it relates this this problem.

@rsc In my original proposal, I'd written:

  1. Use the existing CreateHeader(fh *FileHeader) (io.Writer, error) function, but if CompressedSize64 has already been set, assume that data written is already compressed.

I'm going to assume that option 3 would be a no-go, because existing code might suddenly break if a user has already set the CompressedSize for whatever reason, but hopefully the other options are viable.

Breaking existing code was my only reluctance to this approach. If we don't see that as an issue, it sounds like an easy way to support this.

@escholtz
Copy link
Contributor

@escholtz escholtz commented Jul 16, 2020

@saracen I took another look at the three examples you originally provided and I think they could all be achieved with a Copy method?

  • Repackaging a zip file, removing or adding files, without incurring the associated compression overheads for files that already exist (somewhat achieving #15626)

Using existing zip file(s), don't Copy files you want to remove. Copy existing files you want to keep. Use existing methods to add new files.

  • Compress first and include later based on whether the compressed size was smaller than the original, without having to perform the compression twice.

Write a new zip file, read the compressed size, only Copy if it's smaller.

  • Support concurrent compression: compress many files concurrently and then copy the already compressed files to the archive (similar to Apache Commons Compress' ParallelScatterZipCreator)

Write multiple zip files concurrently and then Copy files from each into a single zip file.


Perhaps this approach isn't as elegant and has slightly more garbage/performance overhead but it avoids the tax of repeatedly compressing the same file which I would expect is the bottleneck for most cases.

@dsnet
Copy link
Member

@dsnet dsnet commented Jul 16, 2020

Breaking existing code was my only reluctance to this approach. If we don't see that as an issue, it sounds like an easy way to support this.

For all Go code inside Google, I only found one instance of someone implicitly setting CompressedSize. The usage pattern looked like the following:

rc, ... := rf.Open()
fileHeader := rf.FileHeader
wc, ... := w.CreateHeader(&fileHeader)
... io.Copy(wc, rc)

Essentially the code is copying one Zip file to another.

If we go with the semantics that a positive CompressedSize means to interpret the input as already compressed, then the pattern above will break since io.Copy will read uncompressed data from rc and write it to wc when the latter is expected compressed data.

We could "fix" this by adding a ReadFrom or WriteTo method that specially handles copying of compressed data from one zip.File to another. However, that approach seems a bit too magical and I can see other edge cases where it fails.

@rsc
Copy link
Contributor

@rsc rsc commented Jul 22, 2020

@dsnet, I don't think we have to fix this, but we could at least detect the problem: if the alg is deflate and we expected compressed data, we can at least check that the first few bytes are a proper deflate header. That would make the io.Copy return an error. But it is a little concerning.

@dsnet
Copy link
Member

@dsnet dsnet commented Jul 22, 2020

proper deflate header.

Unfortunately DEFLATE doesn't really have any distinguishable magic numbers unlike GZIP, which typically has a 10-byte sequence that is readily identifiable. It seems that the best we could do is decompress some number of bytes and assume the input is valid if it passes that short test.

@rsc
Copy link
Contributor

@rsc rsc commented Aug 5, 2020

It sounds like I should look in the Go corpus I have for other code using this pattern. I'll try to do that for next week. Maybe overloading the CompressedSize here is too subtle.

@rsc
Copy link
Contributor

@rsc rsc commented Aug 12, 2020

Will try for next time.

@escholtz
Copy link
Contributor

@escholtz escholtz commented Sep 30, 2020

It feels like we're loosing interest in this proposal and should draw a conclusion.

I don't think overloading CompressedSize is a good approach.

I still like the Copy approach that @rsc chose to use in the zipmerge package. I think all 3 use cases in the original proposal can be solved with a Copy method.

@rsc
Copy link
Contributor

@rsc rsc commented Sep 30, 2020

@escholtz, I haven't yet done the scan, but it doesn't seem like it's necessary - it's a plausible example that we'd break.

Re #34974 (comment), the one thing that would still be good to support would be if you have the header bits in hand along with the compressed form, but you don't have them in an already-existing zip file. (Maybe you want more control over the compression, or you have DEFLATE or an alternate compressed form already from a different archive file, or something like that.)

After rereading this thread now with a bit more distance, I think the fundamental operation is really just access to raw data in both reading and writing. I suggest we add those directly, along with the Copy helper:

// like CreateHeader but the data written to the io.Writer should be precompressed
func (*Writer) CreateRaw(file *FileHeader) (io.Writer, error)

// like Open but the data read back is the compressed form
func (*File) OpenRaw() (io.Reader, error)

// helper - does f.OpenRaw, w.CreateRaw, io.Copy
func (w *Writer) Copy(f *File) error 
@escholtz
Copy link
Contributor

@escholtz escholtz commented Oct 1, 2020

That sounds reasonable. Looking at the package, I assumed this type of access was intentionally not exported for simplicity and reliability. If we're ok with making the package a little less simple, this would enable the advanced use cases we're trying to achieve.

@im7mortal
Copy link

@im7mortal im7mortal commented Oct 3, 2020

Hi everybody! 😄
I was thinking about that for couple of months. After long thinking I created my solution.
I was dreaming to put my ideas in live when I found out this issue 😆
So far I had the same 2 problems in mind

  1. How to copy files without Decompression-Compression.
  2. Precompress multiply files. Or compress parallelly

I had different iterations in my mind. First there was something like @saracen mentioned with the similar looking ReadCompres method. It was dirty and monstrous. So far after long thinking I created an ABSOLUTE copy of Copy method from here 😆 😆 😆 I even did the same optimization with writeDesc func. But I had my genius moment 😄 when I discovered that both tasks could be done only with Copy method.

I am happy that I still can add my penny here and probably resolve this discussion about blowing up methods.

We don't need CreateRaw(file *FileHeader) method. And probably OpenRaw() (io.Reader, error) also.
It's easy resolved only with Copy

Here is the idea


func main() {
       // final writer
	zw := zip.NewWriter(bytes.NewBuffer([]byte{}))
       // get precompressed zip.File
	prFile := Precompress(data)
       // copy precompressed zip.File
	zw.Copy(prFile)
	
}

func Precompress(b []byte) zip.File {
	// create tmp zip writer
	buf := bytes.NewBuffer([]byte{})
	zw := zip.NewWriter(buf)
       // write/compress data in
	zwf, _ := zw.Create("test")
	zwf.Write(b)
	// extract file from tmp zip
	zr,_  := zip.NewReader(readerAt(buf.Bytes()), int64(buf.Len()))
	return zr.File[0]
}

In this way all compression, descriptor validation logic is still encapsulated(!). And there are no overheads (I think) . We compress data to buffer then we do io.Copy to the destination reader. It will be the same in case of compressing it directly.

In case if user want to copy couple of files from big zip and keep them in memory. It's again can be done with Copy. User copy needed files to the tmp zip and then they needed it copy it to destination zip.

I didn't think about the OpenRaw() method before. There are no much use cases so far. If user need to keep only 1 file in compressed state then them can use tmp zips. Use case can be when an user want to extract raw file and send it directly over http to the client 🤔 . In my first iterations I achieved it through Compressor interface. I copied data to the buffer and ignored error from CRC2 check. It worked pretty well for long time. I will post code here when I find it.

Conclusion

  1. The Copy method is critical
  2. Other methods can be implemented through wrappers without overheads
@saracen
Copy link
Author

@saracen saracen commented Oct 3, 2020

I suggest we add those directly, along with the Copy helper:

@rsc

  • I'm not sure we need OpenRaw, as mentioned in #34974 (comment) there's already a few ways to access the raw data. Although, OpenRaw is probably clearer.
  • Personally, I'd prefer CreateHeaderRaw over CreateRaw, I think it's clearer that it's a variant of CreateHeader.
  • I think Copy should also be added 👍

@klauspost's compress library added both CreateHeaderRaw and Copy. Obviously, this is an external library, but it would be nice if the API continued to match.

@im7mortal Using just Copy does work and @escholtz suggested this too. It's a good solution for many problems, but it relies on either your precompressed data being in a zip file or you first copying it to one. Copying precompressed data from a different archive format would require a temporary copy to a new zip file, only to be read back again.

@rsc
Copy link
Contributor

@rsc rsc commented Oct 6, 2020

Personally, I'd prefer CreateHeaderRaw over CreateRaw, I think it's clearer that it's a variant of CreateHeader.

It seems unnecessarily long. There's no way to have CreateRaw that doesn't also take a header.

@dsnet
Copy link
Member

@dsnet dsnet commented Oct 6, 2020

If we plan on adding a CreateHeaderRaw, I suspect that we'll need to pay close attention to #22520. We'll want to make sure we get the semantics right from the start, otherwise they will be very hard to change later on.

@rsc
Copy link
Contributor

@rsc rsc commented Oct 6, 2020

Thanks for the pointer. I don't see why they are necessarily related.
CreateRaw should do the same as CreateHeader, just preparing for raw writes.
It shouldn't mix in other different semantics.

For that issue, I commented there but I think it would be fine to add a new field
that CreateHeader (and then by extension CreateRaw) would use.
I think that can be an independent decision though.

@im7mortal
Copy link

@im7mortal im7mortal commented Oct 9, 2020

@saracen

... but it relies on either your precompressed data being in a zip file or you first copying it to one. Copying precompressed data from a different archive format would require a temporary copy to a new zip file, only to be read back again.

It's not avoidable in both cases. It's just the same but CreateHeaderRaw discards all incapsulated logic. Instead an user must implement it themselves.

Precompression with CreateHeaderRaw

  1. Create crc32 checksum wrapper.
  2. Create byte buffer
  3. Create compressor writer
  4. Write data to the compressor writer through crc32 wrapper
  5. Create raw header zip.CreateHeaderRaw. Ensure crc32 and precompressed size is correctly set
  6. Do copy from buffer(2) to writer(5)
func preompress(b []byte) (check uint32, c []byte, err error) {
	wsumm := crc32.NewIEEE() // an user must ensure correct check sum
	_, err = wsumm.Write(b)
	if err != nil {
		return
	}
	buff := bytes.NewBuffer([]byte{})
	wcmp, err := flate.NewWriter(buff, flate.DefaultCompression) // MUST MATCH with actual compression method
	if err != nil {
		return
	}
	_, err = wcmp.Write(b)
	if err != nil {
		return
	}
	return wsumm.Sum32(), buff.Bytes(), err
}

...
	w := zip.NewWriter(writer)
	crc32_, compressed, err := preompress(data)
	if err != nil {
		return
	}
	wf, err := zip.CreateHeaderRaw(zip.FileHeader{
		Name:               "tmp",
		CRC32:              crc32_,  // an user must ensure correct check sum
        // an user must ensure correct uncompressed size
        // it requires additional logic in case of the pipe writer. CRC32 also
		UncompressedSize64: uint64(len(data)),  
		Method:             zip.Deflate,  // MUST MATCH with actual compression method
	})
	if err != nil {
		return
	}
	_, err = wf.Write(compressed)
	if err != nil {
		return
	}
...

Need wrapper
Additional Write-Read-Write
Unencapsulated. User must implement crc32 sum and propagate it with length correctly
Unencapsulated. of the zip.Decompressor is unused. User can use different method from which set in the header.
Complex wrapper in case of the pipe writer
Allow have corrupted file in the archive ⚠️ Can cause crash of receiver zip libraries

Precompress through tmp zip

  1. Create tmp zip writer
  2. Write data to the tmp zip. crc32 and byte counters are encapsulated
  3. Extract zip.File
  4. Perform zipWriter.Copy. Which do io.Copy

In both cases there is need to write wrappers.

func PrecompressInZip(b []byte, name string, method uint16) (zf zip.File, err error) {
	// create tmp zip writer
	buf := bytes.NewBuffer([]byte{})
	zw := zip.NewWriter(buf)
	// write/compress data in
	zwf, err := zw.CreateHeader(&zip.FileHeader{
		Name: name,
		Method: method,
	})
	if err != nil {
		return
	}
	_, err = zwf.Write(b)
	if err != nil {
		return
	}
	// extract file from tmp zip
	zr, err := zip.NewReader(readerAt(buf.Bytes()), int64(buf.Len()))
	return zr.File[0], err
}
...
	// final writer
	zw := zip.NewWriter(bytes.NewBuffer([]byte{}))
	// get precompressed zip.File
	prFile, err := Precompress(data, zip.Deflate)
	if err != nil {
		return
	}
	// copy precompressed zip.File
	zw.Copy(prFile)
...

Need wrapper
Additional Write-Read-Write
⚠️ add directoryHeaderLen=46 bytes of metadata (it vary but very small)
📗 CRC32 summ and length are encapsulated and ensured by very good logic
📗 Unencapsulation of the zip.Decompressor is used fully. The writer use only corresponding compressor. No way to mix up!
📗 Pipe wrappers do not require additional logic and can be handled through std libs
📗 Don't allow have corrupted file in the archive.

Conclusion
@rsc The CreateHeaderRaw is very error prone. All encapsulated logic from zip library must be reimplemented by user.

  • it's not guaranteed that compression method matches with compression implementation
  • CRC2 hash and uncompressed size must be managed by user. Can be complicated in case of the pipe writers.
  • there are no advantages against zip.Copy from temporary zip. It require the same Write-Read-Write sequence.
  • it allow to write silently corrupted files to the archive what can cause crash on the reader side of other libs
@escholtz
Copy link
Contributor

@escholtz escholtz commented Oct 9, 2020

I agree with @im7mortal on this.

I think most gophers using the archive/zip package are looking for simplicity. They don't care about the specifics of the zip format, they just want to be able to read and write files.

A much smaller number of gophers are looking for a zip package that exports all of the guts that archive/zip currently hides. Building blocks to make it easier to build a custom / optimized zip package.

I think the standard library should lean more towards the simple case and not the building blocks case. I worry if we try to do both, we end up with a muddled package that is more complicated and confusing for the typical consumer.

I still don't know if Copy should be added (even though I personally want it so I don't have to maintain a fork). On the one hand, it maintains safety and doesn't add too much complexity. On the other hand, it's only necessary to support less common use cases. And it still doesn't support advanced use cases (copying from other archive formats, although I would suspect this is even more rare than copying between zips).

@saracen
Copy link
Author

@saracen saracen commented Oct 9, 2020

@im7mortal

It's not avoidable in both cases. It's just the same but CreateHeaderRaw discards all incapsulated logic.

There are cases where this is avoidable. You can have compressed data, know the uncompressed size and either already have the crc32 checksum of the uncompressed data or calculating it is still much faster than decompressing and recompressing.

The compressed data doesn't even need to be DEFLATE. This package already supports vendor compression methods, and there's certainly use-cases where I might want to move pre-existing compressed data (say zstd) into the zip structure, over another archive format.

The whole point of CreateRaw is to give more control over this and to go around the safety offered by the existing functionality. The zip package already allows for this to a certain degree, and it's already possible to create corrupt/invalid archives accidentally.

@escholtz Regarding simplicity, this is difficult to not agree with.

Supporting vendor compression methods and arbitrary extra data are also features that are not going to be used by everybody, but they've been added. I personally don't think that CreateRaw is going to confuse users or take away from the simplicitiy of the more common functions provided, but I understand your point.

@im7mortal
Copy link

@im7mortal im7mortal commented Oct 13, 2020

@saracen

There are cases where this is avoidable. You can have compressed data, know the uncompressed size and either already have the crc32 checksum of the uncompressed data or calculating it is still much faster than decompressing and recompressing.

... into the zip structure, over another archive format.

I am agree that there are cases. I think we could consult with #15626 (Delete, Append, Replace methods).

The most common use cases parallel compress, precompress, copy between zips can use Copy with negligible overhead.

@rsc
Copy link
Contributor

@rsc rsc commented Oct 14, 2020

@im7mortal, you are right that much can be done with Copy. That's why I included it in the suggestion in #34974 (comment) to add CreateRaw, OpenRaw, and Copy (all three), even though technically Copy can be implemented in terms of the other two.

In contrast, dropping the other two makes the use cases that need that low-level access fairly awkward and non-obvious.

It doesn't seem harmful to have these two important building blocks alongside the higher-level helper.

@rsc
Copy link
Contributor

@rsc rsc commented Oct 14, 2020

Does anyone object to adding CreateRaw, OpenRaw, and Copy?

@rsc rsc changed the title proposal: archive/zip: add already compressed files proposal: archive/zip: add File.OpenRaw, Writer.CreateRaw, Writer.Copy Oct 14, 2020
@rsc
Copy link
Contributor

@rsc rsc commented Oct 21, 2020

Based on the discussion above, this seems like a likely accept.
This is subtle, though, so it would make sense to put off until Go 1.17 at this point.

@rsc rsc moved this from Active to Likely Accept in Proposals Oct 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Proposals
Likely Accept
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants
You can’t perform that action at this time.