Skip to content
This repository has been archived by the owner on Jan 26, 2020. It is now read-only.

Streaming interface #4

Closed
fasterthanlime opened this issue Nov 16, 2015 · 31 comments
Closed

Streaming interface #4

fasterthanlime opened this issue Nov 16, 2015 · 31 comments

Comments

@fasterthanlime
Copy link
Contributor

I'm currently looking into supporting BrotliDecompressStream and its compress equivalent, and I had a simple question: the README mentions the need to wrap some C++ structures, but (as of 10 days ago), BrotliDecompressStream & friends seem to be usable purely from the C interface?

It looks relatively easy to do but I'm a cgo noob so it might take me a day or two.

(unsafe.)Pointers welcome!

P.S: do tell me if this issue is unwelcome noise on the issue tracker, in which case I'll try & get it working in my corner and come back with a PR later — just thought I'd deduplicate work in case someone is already on it!

@fasterthanlime
Copy link
Contributor Author

Link to the new stream interface: https://github.com/google/brotli/blob/a6881eb3090d9d0c08e3985e6a22b7a47cecfbab/dec/decode.h#L140

function signature:

BrotliResult BrotliDecompressStream(size_t* available_in,
                                    const uint8_t** next_in,
                                    size_t* available_out,
                                    uint8_t** next_out,
                                    size_t* total_out,
                                    BrotliState* s);

@fasterthanlime
Copy link
Contributor Author

Link to example usage of the new stream interface (in BrotliDecompressBuffer): https://github.com/google/brotli/blob/c90ec29f54e9165df815b0f1311cfddb1be4afad/dec/decode.c#L1775

code for reference:

BrotliResult BrotliDecompressBuffer(size_t encoded_size,
                                    const uint8_t* encoded_buffer,
                                    size_t* decoded_size,
                                    uint8_t* decoded_buffer) {
  BrotliState s;
  BrotliResult result;
  size_t total_out = 0;
  size_t available_in = encoded_size;
  const uint8_t* next_in = encoded_buffer;
  size_t available_out = *decoded_size;
  uint8_t* next_out = decoded_buffer;
  BrotliStateInit(&s);
  result = BrotliDecompressStream(&available_in, &next_in, &available_out,
      &next_out, &total_out, &s);
  *decoded_size = total_out;
  BrotliStateCleanup(&s);
  if (result != BROTLI_RESULT_SUCCESS) {
    result = BROTLI_RESULT_ERROR;
  }
  return result;
}

@fasterthanlime
Copy link
Contributor Author

Ah, it seems there's still no C-only interface for the streaming encoder. Got it.

@kothar
Copy link
Owner

kothar commented Nov 16, 2015

Ah, I was excited there for a moment :)

I think wrapping the Encoder should be reasonably straightforward, but my C++ is a bit rusty so I've been putting it off.

@fasterthanlime
Copy link
Contributor Author

Ah, I was excited there for a moment :)

Me too! But let's get it working anyway :shipit:

I think wrapping the Encoder should be reasonably straightforward, but my C++ is a bit rusty so I've been putting it off.

Alright let's give it a go then :)

@kothar
Copy link
Owner

kothar commented Nov 16, 2015

This looks like a good starting place: http://stackoverflow.com/a/1721230/37416

@kothar
Copy link
Owner

kothar commented Nov 16, 2015

Also, finalizers are probably necessary to avoid memory leaks: https://golang.org/pkg/runtime/#SetFinalizer

@fasterthanlime
Copy link
Contributor Author

This looks like a good starting place: http://stackoverflow.com/a/1721230/37416

Yup, although this is from back in '09, possibly when go didn't build the .c/.h files found inthe same directory as a cgo-using .go file ? Should be even simpler now

Also, finalizers

👍 keeping that in mind

@fasterthanlime
Copy link
Contributor Author

Hmm might have to rename 'BrotliParams' to 'CBrotliParams' or something, otherwise encode_go.cc (which needs to include both the original cpp header + our glue c header) is going to make C++ compiler scream... I think

@fasterthanlime
Copy link
Contributor Author

Copying this here for reference — minus the Brotli In/Out classes, it uses the encoding API I'm going for (CopyInputToRingBuffer + WriteBrotliData)

int BrotliCompressWithCustomDictionary(size_t dictsize, const uint8_t* dict,
                                       BrotliParams params,
                                       BrotliIn* in, BrotliOut* out) {
  size_t in_bytes = 0;
  size_t out_bytes = 0;
  uint8_t* output;
  bool final_block = false;
  BrotliCompressor compressor(params);
  if (dictsize != 0) compressor.BrotliSetCustomDictionary(dictsize, dict);
  while (!final_block) {
    in_bytes = CopyOneBlockToRingBuffer(in, &compressor);
    final_block = in_bytes == 0 || BrotliInIsFinished(in);
    out_bytes = 0;
    if (!compressor.WriteBrotliData(final_block,
                                    /* force_flush = */ false,
                                    &out_bytes, &output)) {
      return false;
    }
    if (out_bytes > 0 && !out->Write(output, out_bytes)) {
      return false;
    }
  }
  return true;
}

@kothar
Copy link
Owner

kothar commented Nov 16, 2015

BrotliParams shoud be fine, I am already interfacing with that struct as C.struct_BrotliParams from Go.

@fasterthanlime
Copy link
Contributor Author

cf. itchio@652b90e to see what I mean — from encode_go.cc, we need to include both of these:

  • encode.h (the original, brotli C++ header)
  • encode_go.h (our C wrapper header)

And before my changes, both contained a (conflicting) declaration for BrotliParams. I changed it to CBrotliParams for now, the encode tests still runs but I haven't tested the streaming interface yet (and no finalizers yet either)

@kothar
Copy link
Owner

kothar commented Nov 16, 2015

Ah, I understand. In that case we should just factor out the part of the header that is referred to in both places and put it in a new brotli_params.h or similar file. Both places should be using the same struct, but I just constructed a new C-compatible header for interfacing with Go by copy-and-pasting the stuct definition into the new header. I'd avoid as many changes to the upstream code as possible.

Ignore me, I forgot that the original Params object was a C++ struct

@fasterthanlime
Copy link
Contributor Author

@kothar I think we can probably still do that (I definitely understand the appeal of being able to just copy-paste a bit of the brotli header and hand-modify it as little as possible), I'll definitely look into it.

edit: it's probably as easy as having encode_go.cc not import encode_go.h and just declare prototypes with the C++ version of BrotliParams

As of itchio@888315f the enc streaming interface seems to work well, still looking into finalizers + documenting it properly, then I'll work on the decoder interface I suppose

@fasterthanlime
Copy link
Contributor Author

Realizing now my streaming test isn't so good, because it only copies to ringbuffer once (4.1MB of sample data < 4.3MB ring buffer), adjusting with a smaller lgwin.

@fasterthanlime
Copy link
Contributor Author

@kothar Is there any reason why 11 is the default quality? And that it's also used in the encoding test? The upstream tests use 1, 6, 9, 11. I was thinking of using a lower quality for the streaming encode test because ~6s on my 2014 MBP is a bit of a steep price to pay for such a test :( lowering to q=2 brings the entire test down to 0.13s

@kothar
Copy link
Owner

kothar commented Nov 17, 2015

My contribution: stream decompression: #6

@kothar
Copy link
Owner

kothar commented Nov 17, 2015

Is there any reason why 11 is the default quality?

That appeared to be default in the BrotliParams initialiser, so I left it as is. Feel free to use something else for testing. I imagine most people compressing with Brotli will want something like level 9 for most stuff, or 11 if they are pre-compressing archives.

@fasterthanlime
Copy link
Contributor Author

@kothar in my limited experience with brotli compression, 9 and 10/11 are completely different order of magnitudes, so even changing to 9 would be a nice speedup imho

@fasterthanlime
Copy link
Contributor Author

Ah, having an io.Writer-based interface for the enc would be really nice as well! I'm on it

@kothar
Copy link
Owner

kothar commented Nov 17, 2015

It looks like it should be easy to wrap your compressor with Writer fairly easily - I guess you'd need a small buffer which gets flushed through the compressor when full, and a final block when the stream is closed.

@kothar
Copy link
Owner

kothar commented Nov 17, 2015

Regarding SliceHeader - I was advised never to use it! https://groups.google.com/d/msg/golang-nuts/_qu3u9HDVK8/crR7aNkRCwAJ

The advice seems to be to only allocate buffers on the Go side, and then pass them as an argument for writing to them.

@fasterthanlime
Copy link
Contributor Author

Ah, I didn't know that SliceHeader was to be avoided. Looking at encode.cc (WriteBrotliData in particular) there's no way to pass our own buffer, they allocate their own storage — so we'll have to do some buffer copying

@fasterthanlime
Copy link
Contributor Author

I guess you'd need a small buffer which gets flushed through the compressor when full, and a final block when the stream is closed.

Well the compressor has its own (ring) buffer, so we can just keep track of how much we've already copied into it, and as soon as reach the input block size (or the BrotliWriter is closed), then we call WriteBrotliData and write that to whatever we're given as output

What would be the purpose of the final block? I see encode.h exposes a FinishStream which writes a zero-length meta-block but I don't see it used anywhere. I was thinking it may be used to write multiple sequential streams with the same compressor.. but BrotliCompressor's documentation explicitly says:

// An instance can not be reused for multiple brotli streams.

@fasterthanlime
Copy link
Contributor Author

Re SliceHeader:

  • Let's not use SliceHeader then
  • However, I would really like to avoid allocating one buffer per call to WriteBrotliData
  • We can't predict how big of a buffer we're going to need (in truth, blockSize * 2 is probably more than enough... but part of why I wanted the streaming is that I want my code to run in memory-constrainted environments)
  • So, we could allocate one & resize it when it's not long enough to hold the data returned by WriteBrotliData

What do you think?

@fasterthanlime
Copy link
Contributor Author

What would be the purpose of the final block?

Answering my own question, I think the isLast parameter in WriteBrotliData means we don't have to do that, but I'm not entirely sure.

@kothar
Copy link
Owner

kothar commented Nov 17, 2015

Encode parallel uses a buffer of 2x input + 500 as a temporary output buffer for each block, and falls back to an uncompressed block if it runs out of space.

That seems to be because each input byte could theoretically be encoded as 2 bytes, plus the metablock header.

TBH, I can't think of a reason to even allocate 2x input in that case, since anything larger than the input + metablock header isn't worth compressing.

https://github.com/kothar/brotli-go/blob/master/enc/encode_parallel.cc#L154

@fasterthanlime
Copy link
Contributor Author

TBH, I can't think of a reason to even allocate 2x input in that case, since anything larger than the input + metablock header isn't worth compressing.

Right, but if I'm using brotli as part of, say, a network protocol, and for some one-in-a-thousand cases the compressed output is larger than the uncompressed version I'd still want the pipeline to remain working instead of it randomly throwing an error and me having to code a backup

@fasterthanlime
Copy link
Contributor Author

Note to self: drop the Brotli in BrotliCompressor for the enc api, as per golang packaging/naming practices

edit: Hmm but then it'd be enc.Compressor which isn't really informative... :/ Any ideas @kothar ?

@fasterthanlime
Copy link
Contributor Author

cf. #7 btw (just so the PR and the issue are cross-referenced)

@fasterthanlime
Copy link
Contributor Author

closing as PR is available

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants