Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utility methods to read payload body #48

Open
sebastian-nagel opened this issue Jun 21, 2020 · 1 comment
Open

Utility methods to read payload body #48

sebastian-nagel opened this issue Jun 21, 2020 · 1 comment

Comments

@sebastian-nagel
Copy link
Contributor

Most consumers of the content payload require the payload to be

  1. decoded using the provided HTTP Content-Encoding
  2. available as byte[] (eg. Tika) or even String (eg. Jsoup)

I've found myself writing similar code when consuming the payload body of WarcResponse records: jwarc's extract tool #41, a sitemap tester and StormCrawler. In order to make jwarc more usable, I'd propose to bundle the following functionality in two/few utility methods:

  • return the decoded payload body as channel using the HTTP Content-Encoding
    • with configurable behavior (fail or return payload without decoding) when Content-Encoding isn't understood or is not reliable (gzip without gzip magic/header)
    • ev. make it possible to pass decoders for encodings not supported by jwarc, eg. brotli (I assume that jwarc is designed to have zero dependencies)
    • or should the decoding functionality provided in a class HttpPayload extending WarcPayload?
  • read the (decoded) payload into byte[] (or ByteBuffer)
    • optionally limit the max. size of the byte[] array to ensure that oversized captures do not cause any issues
@ato
Copy link
Member

ato commented Jun 22, 2020

Having something like a decode() or bodyDecoded() convenience method on both HttpMessage and WarcPayload that decodes the content encoding seems reasonable to me.

record.payload().decode() -> MessageBody?
response.http().decode() -> MessageBody?

I think we could make brotli an optional maven dependency and if it's present on the classpath we use it.

read the (decoded) payload into byte[] (or ByteBuffer)

Note that from Java 9 you can do body().stream().readAllBytes() and body().stream().readNBytes(buf, off, len). Not opposed to having our own as there's still quite a few people targeting 8 though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants