Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool to extract a WARC record (or its headers or payload) #41

Merged
merged 1 commit into from
May 29, 2020

Conversation

sebastian-nagel
Copy link
Contributor

Extract a WARC record given the record offset, inspired by warcio's extract tool.

  • extracted payload is decoded (transfer and content encoding), by now only "gzip" is supported as Content-Encoding
  • minor glitch: the order of headers is not preserved when they're taken from WarcRecord, they're shown in lexical order defined by the TreeMap holding the headers internally.

@ato
Copy link
Member

ato commented May 29, 2020

Nice!

the order of headers is not preserved when they're taken from WarcRecord

We also remove excess surrounding whitespace, unfold headers and if there's duplicate header field names with different case (WARC-CONCURRENT-TO, warc-concurrent-to) only one variant is kept. Maybe the parser should keep a copy of the raw header bytes for use cases where you want to copy or display the raw header unmodified.

@ato ato merged commit bcaad8b into iipc:master May 29, 2020
@ato ato mentioned this pull request May 29, 2020
ato added a commit that referenced this pull request Jun 23, 2020
New features

* New tool to extract a WARC record, headers or payload #41 #47 (Sebastian Nagel)
* Improved logging of MediaType parse errors #43 (Sebastian Nagel)

Bugs fixed

* Lenient http parser now accepts header names that are empty or contain invalid characters #51
* GzipChannel.write() now returns the number of consumed bytes instead of compressed bytes written #46 (Sebastian Nagel)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants