Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.
Sign upArchive Method: Add WARC file output #6
Comments
This comment has been minimized.
This comment has been minimized.
|
See: #11 |
pirate
closed this
May 16, 2017
This comment has been minimized.
This comment has been minimized.
eqyiel
commented
Jan 1, 2018
|
Can this be re-opened? The linked issue is about the webarchive format, which is a totally different thing. https://en.wikipedia.org/wiki/Webarchive WARC is the one that archiveteam and libraries have thrown their weight behind: http://fileformats.archiveteam.org/wiki/WARC |
This comment has been minimized.
This comment has been minimized.
|
Oh I didn't know they were different, thanks @eqyiel. |
pirate
reopened this
Jan 1, 2018
pirate
added
the
help wanted
label
Jan 7, 2018
pirate
changed the title
Please add WARC format
Archive Method: Add WARC file output
Jan 7, 2018
pirate
added this to the v0.0.4 milestone
Jan 9, 2018
This comment has been minimized.
This comment has been minimized.
|
Just requires adding a new config |
pirate
added
the
complexity: medium
label
Jan 10, 2018
pirate
added
the
changes: config
label
Apr 6, 2018
This comment has been minimized.
This comment has been minimized.
anarcat
commented
Sep 13, 2018
|
for the record, pywb has trouble reading wget WARC file as its output is non-standard: webrecorder/pywb#294 you might want to consider another crawler for the task or see that wget fixes their stuff first. |
This comment has been minimized.
This comment has been minimized.
|
The bookmark-archiver is mentioned in a recent LWN article: WARC archives seems the be the way to go for archiving webpages, especially for SPAs and other JS-heavy and streaming pages. Wget is -- according to that article -- a little broken in terms of WARC output, but there are a list of alternatives which works fine. It would be very nice if bookmark-archiver get support for WARC archives. |
This comment has been minimized.
This comment has been minimized.
|
I saw the article, and I actually emailed @anarcat about it since it turns out we both live in Montreal. I think if I add WARC saving I want to do it with https://github.com/internetarchive/brozzler and a headless browser to cover all the edge cases around JS execution. |
pirate
removed
the
complexity: medium
label
Nov 23, 2018
This comment has been minimized.
This comment has been minimized.
|
The easiest way to do this might be to just point the headless browser you already run to a WARC recording proxy. However brozzler looks like a nicely maintained piece of software (and way better than the old Java stuff the IA was using before), so maybe migrating to that can be a longer term goal. EDIT: just noticed there's #113 for brozzler already. |
pirate
removed
the
good first issue
label
Dec 7, 2018
This comment has been minimized.
This comment has been minimized.
|
Wget WARC file output is now supported in e8808b0, via the I'll investigate creating another a more complete warc including dynamic requests with the chrome proxy, but for now the simpler wget one is worth testing in the interim. |
pirate
closed this
Jan 11, 2019
This comment has been minimized.
This comment has been minimized.
|
WARCs are nice for bundling all resources needed to render a page in a browser, but using wget generates a WARC which is just a container for the HTML. Maybe open a new issue for the browser integration? |
This comment has been minimized.
This comment has been minimized.
|
Yeah, actually I think I'll turn it off by default for now because it's too minimal to be a default until we get Here's the new issue to track the all-in-one WARC file #130 |
bmix commentedMay 5, 2017
An overview of existing projects to consume them is here:
http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem#WARC_viewer