Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(zip): ZipOutputStream async support #574

Merged
merged 11 commits into from
Oct 9, 2021
Merged

Conversation

piksel
Copy link
Member

@piksel piksel commented Feb 14, 2021

Based on #547, but rather than duplicate the already massive zip format writing methods, it combines all the "logic" (which isn't really async anyway) and only does the actual writing using a-/sync methods.

This also moves the byte order swapping from the odd ZipHelperStream into a much more lightweight (and probably faster) ZipFormat helper class. It should also help DRYing up the shared code in ZipFile and Zip*Stream.

No actual benchmarks have been run yet, but the tests should pass.

I certify that I own, and have sufficient rights to contribute, all source code and related material intended to be compiled or integrated with the source code for the SharpZipLib open source product (the "Contribution"). My Contribution is licensed under the MIT License.

0xced and others added 3 commits December 7, 2020 16:47
This is a very rough first pass at adding async support to ZipOutputStream.

All method doing synchronous stream writes have been duplicated with an asynchronous counterpart, starting from `ZipOutputStream.PutNextEntryAsync()`. Unfortunately, this makes a lot of duplicated code but is unavoidable to have both sync + async code paths.

.NET Standard 2.1 has been added to the target frameworks. It's required to get full async support with `Stream.DisposeAsync()`.

New unit tests must be written to cover the new async code paths.

Fixes #223
@cloghead
Copy link

cloghead commented May 7, 2021

Hi, just a question: do you think async support will be merged in in the near future? It seems a bit stopped. Could I help to get it in?

@piksel piksel linked an issue May 8, 2021 that may be closed by this pull request
@piksel
Copy link
Member Author

piksel commented May 8, 2021

@cloghead It's not the easiest thing to pick up for a new project, but some code review could be nice.

This will likely be the focus for the next minor release.

@Numpsy
Copy link
Contributor

Numpsy commented May 8, 2021

No actual benchmarks have been run yet, but the tests should pass.

fwiw, I had a go at running a benchmark:

With the current master branch as of now:

|                    Method |        Job |            Toolchain |     Mean |   Error |  StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |----------- |--------------------- |---------:|--------:|--------:|------:|------:|------:|------:|----------:|
|      WriteZipOutputStream | Job-UFKFKR |             .NET 5.0 | 477.1 ms | 2.04 ms | 1.90 ms |  0.97 |     - |     - |     - |    354 KB |
|      WriteZipOutputStream | Job-IBHCHM |        .NET Core 2.1 | 500.4 ms | 0.59 ms | 0.55 ms |  1.01 |     - |     - |     - |    355 KB |
|      WriteZipOutputStream | Job-LMPLDE |        .NET Core 3.1 | 515.0 ms | 1.61 ms | 1.51 ms |  1.04 |     - |     - |     - |    356 KB |
|      WriteZipOutputStream | Job-EFLWUE | .NET Framework 4.6.1 | 493.4 ms | 1.61 ms | 1.43 ms |  1.00 |     - |     - |     - |    360 KB |
|                           |            |                      |          |         |         |       |       |       |       |           |
| WriteZipOutputStreamAsync | Job-UFKFKR |             .NET 5.0 | 477.5 ms | 3.60 ms | 3.19 ms |  0.94 |     - |     - |     - |    370 KB |
| WriteZipOutputStreamAsync | Job-IBHCHM |        .NET Core 2.1 | 500.7 ms | 0.55 ms | 0.43 ms |  0.99 |     - |     - |     - |    308 KB |
| WriteZipOutputStreamAsync | Job-LMPLDE |        .NET Core 3.1 | 511.6 ms | 1.59 ms | 1.48 ms |  1.01 |     - |     - |     - |    362 KB |
| WriteZipOutputStreamAsync | Job-EFLWUE | .NET Framework 4.6.1 | 506.2 ms | 2.69 ms | 2.52 ms |  1.00 |     - |     - |     - |    384 KB |

with this branch:

|                    Method |        Job |            Toolchain |     Mean |   Error |  StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |----------- |--------------------- |---------:|--------:|--------:|------:|------:|------:|------:|----------:|
|      WriteZipOutputStream | Job-WLVDHB |             .NET 5.0 | 489.1 ms | 8.24 ms | 7.31 ms |  0.97 |     - |     - |     - |    356 KB |
|      WriteZipOutputStream | Job-XDTUCK |        .NET Core 2.1 | 508.6 ms | 1.30 ms | 1.15 ms |  1.01 |     - |     - |     - |    354 KB |
|      WriteZipOutputStream | Job-TITBIN |        .NET Core 3.1 | 501.9 ms | 0.89 ms | 0.69 ms |  1.00 |     - |     - |     - |    356 KB |
|      WriteZipOutputStream | Job-AGIGIJ | .NET Framework 4.6.1 | 501.6 ms | 2.22 ms | 2.08 ms |  1.00 |     - |     - |     - |    360 KB |
|                           |            |                      |          |         |         |       |       |       |       |           |
| WriteZipOutputStreamAsync | Job-WLVDHB |             .NET 5.0 | 473.8 ms | 2.10 ms | 1.87 ms |  0.94 |     - |     - |     - |    363 KB |
| WriteZipOutputStreamAsync | Job-XDTUCK |        .NET Core 2.1 | 496.9 ms | 1.32 ms | 1.17 ms |  0.98 |     - |     - |     - |    308 KB |
| WriteZipOutputStreamAsync | Job-TITBIN |        .NET Core 3.1 | 523.6 ms | 2.43 ms | 2.28 ms |  1.04 |     - |     - |     - |    363 KB |
| WriteZipOutputStreamAsync | Job-AGIGIJ | .NET Framework 4.6.1 | 505.7 ms | 2.09 ms | 1.85 ms |  1.00 |     - |     - |     - |    376 KB |

which is mixed, but no massive differences (though I think this branch is missing the Array.Empty optimization that might have a small negative effect, and I have no idea why the async memory usage for .net core 2.1 is so much lower than all the others)

@piksel
Copy link
Member Author

piksel commented May 10, 2021

@Numpsy That's pretty much what I expected (not the odd memory usage ofc). The overhead that got removed was just instantiating ZipHelperStream, which I guess should be fairly easy to optimize out by the JIT. I'll do a rebase and compare the results.

@codecov
Copy link

codecov bot commented May 10, 2021

Codecov Report

Merging #574 (34901ae) into master (d34e5c9) will increase coverage by 0.50%.
The diff coverage is 88.63%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #574      +/-   ##
==========================================
+ Coverage   73.31%   73.82%   +0.50%     
==========================================
  Files          68       69       +1     
  Lines        8305     8291      -14     
==========================================
+ Hits         6089     6121      +32     
+ Misses       2216     2170      -46     
Impacted Files Coverage Δ
...ib/Zip/Compression/Streams/DeflaterOutputStream.cs 74.52% <46.42%> (-10.84%) ⬇️
src/ICSharpCode.SharpZipLib/Zip/ZipExtraData.cs 75.78% <81.81%> (-0.56%) ⬇️
src/ICSharpCode.SharpZipLib/Zip/ZipFormat.cs 89.54% <89.54%> (ø)
src/ICSharpCode.SharpZipLib/Zip/ZipFile.cs 77.32% <90.90%> (+0.01%) ⬆️
src/ICSharpCode.SharpZipLib/Zip/ZipOutputStream.cs 87.75% <97.22%> (-4.29%) ⬇️
src/ICSharpCode.SharpZipLib/Core/ByteOrderUtils.cs 100.00% <100.00%> (ø)
src/ICSharpCode.SharpZipLib/Core/StreamUtils.cs 46.00% <100.00%> (+6.86%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d34e5c9...34901ae. Read the comment docs.

@piksel
Copy link
Member Author

piksel commented May 10, 2021

Ran the sync benchmark against the branch after merge and against master (no async since that isn't checked in?)
It's basically the same, which makes sense, since the actual changes are only in a small fraction of the time spent during compression, writing the metadata/headers.
The gain here is that we get "true" async writing, without increasing the complexity that much, in fact, this is a step closer to combining lots of duplicated code in ZipFile and Zip*Streams. As long as the performance isn't significantly worse, the net gain is huge imo.

Common output for both tests:

$ dotnet run -c release -f netcoreapp2.1 -p benchmark/ICSharpCode.SharpZipLib.Benchmark -- -f *.Zip.*
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
AMD Ryzen 7 3800X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.300-preview.21180.15
  [Host]     : .NET Core 2.1.27 (CoreCLR 4.6.29916.01, CoreFX 4.6.29916.03), X64 RyuJIT
  Job-WADFEO : .NET Core 2.1.27 (CoreCLR 4.6.29916.01, CoreFX 4.6.29916.03), X64 RyuJIT
  Job-QPXRRC : .NET Core 3.1.14 (CoreCLR 4.700.21.16201, CoreFX 4.700.21.16208), X64 RyuJIT
  Job-LAJRXL : .NET Framework 4.8 (4.8.4300.0), X64 RyuJIT

Current master:

Branch Job Toolchain Mean Error StdDev Ratio RatioSD Allocated
master Job-MTKPDI .NET Core 2.1 553.0 ms 10.32 ms 10.14 ms 1.00 0.02 354.54 KB
master Job-ZFQRCY .NET Core 3.1 557.0 ms 10.80 ms 14.04 ms 1.02 0.03 354.49 KB
master Job-BPXFLD net461 551.9 ms 9.58 ms 8.96 ms 1.00 0.00 360.28 KB

async-ZipOutputStream:

Branch Job Toolchain Mean Error StdDev Ratio RatioSD Allocated
PR Job-WADFEO .NET Core 2.1 552.8 ms 9.03 ms 8.45 ms 1.01 0.03 354.85 KB
PR Job-QPXRRC .NET Core 3.1 565.0 ms 10.41 ms 9.74 ms 1.03 0.03 354.8 KB
PR Job-LAJRXL net461 549.3 ms 10.87 ms 13.74 ms 1.00 0.00 360.28 KB

@Numpsy
Copy link
Contributor

Numpsy commented May 10, 2021

no async since that isn't checked in

I added an extra async benchmark locally, I can send a PR to add it if it's useful

@piksel piksel self-assigned this Jun 8, 2021
@piksel piksel added this to the v1.4 milestone Jun 8, 2021
# Conflicts:
#	src/ICSharpCode.SharpZipLib/Zip/Compression/Streams/DeflaterOutputStream.cs
#	src/ICSharpCode.SharpZipLib/Zip/ZipOutputStream.cs
@guipcarvalho
Copy link

Hello guys, is this code going to be merged in the near future? I'm facing this issue where I need to create a zip out of a asynconlystream

@Numpsy
Copy link
Contributor

Numpsy commented Oct 6, 2021

I haven't had a chance to look at anything for some time and have lost track of what state the async work was in, need to remind myself what was going on :-(

@piksel
Copy link
Member Author

piksel commented Oct 6, 2021

@guipcarvalho That is the plan, but I have been busy with other projects. This should be ready to merge, I'll try to find some time to get it done this weekend.

@piksel piksel changed the title Alternative ZipOutputStream async support feat(zip): ZipOutputStream async support Oct 9, 2021
@piksel piksel merged commit e1e1a91 into master Oct 9, 2021
@piksel piksel deleted the async-ZipOutputStream branch October 9, 2021 16:55
@guipcarvalho
Copy link

Thank you @piksel!

@Numpsy
Copy link
Contributor

Numpsy commented Oct 9, 2021

I think I might have queried before, but don't recall the answer, so just in case - is there a need to check the behavior of GzipOutputStream with regards to the new functions on DeflaterOutputStream? (e.g. is it possible to call FinishAsync or DisposeAsync through a GzipOutputStream, and if so does doing so cause any problems?)

@piksel
Copy link
Member Author

piksel commented Oct 9, 2021

@Numpsy I made a test that was simply an Async version of GzipTests.SmallBufferDecompression, and it seems to work fine... 🤷‍♀️

using (var gzos = new GZipOutputStream(msGzip))
{
    gzos.IsStreamOwner = false;
    
    await gzos.WriteAsync(inputBuffer, 0, inputBuffer.Length);
    await gzos.FlushAsync();
    await gzos.FinishAsync(CancellationToken.None);
}

@Numpsy
Copy link
Contributor

Numpsy commented Oct 9, 2021

I tried it with

var content = "FileContents";

    using var ms = new MemoryStream();
    await using (var outStream = new GZipOutputStream(ms) { IsStreamOwner = false })
    {
	outStream.FileName = "/path/to/file.ext";

	var writeBuffer = Encoding.ASCII.GetBytes(content);
	outStream.Write(writeBuffer, 0, writeBuffer.Length);
	outStream.Flush();
	await outStream.FinishAsync(default);
    }

    ms.Seek(0, SeekOrigin.Begin);

    using (var inStream = new GZipInputStream(ms))
    {
	using (var sr = new StreamReader(inStream, Encoding.UTF8))
	{
		var readContent = sr.ReadToEnd();
		Assert.AreEqual(content, readContent);
		Assert.AreEqual("file.ext", inStream.GetFilename());
	}
   }

and got this error:

image

The test seems to pass if the read side is just

using (var inStream = new GZipInputStream(ms))
{
	var readBuffer = new byte[content.Length];
	inStream.Read(readBuffer, 0, readBuffer.Length);
	Assert.AreEqual(content, Encoding.ASCII.GetString(readBuffer));
	Assert.AreEqual("file.ext", inStream.GetFilename());
}

because I don't think it tries to read the gzip footer in order to read a subset of the stream data?
It also passes with the write side using the async version of Finish but with a plain using on the output stream rather than using async

@piksel
Copy link
Member Author

piksel commented Oct 10, 2021

Yeah, it's a bit weird, hence #672.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feature request: full async/await pattern stream support
5 participants