Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive Method: Add WARC file output #6

Closed
bmix opened this issue May 5, 2017 · 11 comments

Comments

Projects
None yet
6 participants
@bmix
Copy link

commented May 5, 2017

An overview of existing projects to consume them is here:
http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem#WARC_viewer

@pirate

This comment has been minimized.

Copy link
Owner

commented May 16, 2017

See: #11

@pirate pirate closed this May 16, 2017

@eqyiel

This comment has been minimized.

Copy link

commented Jan 1, 2018

Can this be re-opened? The linked issue is about the webarchive format, which is a totally different thing.

https://en.wikipedia.org/wiki/Webarchive
https://en.wikipedia.org/wiki/Web_ARChive

WARC is the one that archiveteam and libraries have thrown their weight behind: http://fileformats.archiveteam.org/wiki/WARC

@pirate

This comment has been minimized.

Copy link
Owner

commented Jan 1, 2018

Oh I didn't know they were different, thanks @eqyiel.

@pirate pirate reopened this Jan 1, 2018

@pirate pirate added the help wanted label Jan 7, 2018

@pirate pirate changed the title Please add WARC format Archive Method: Add WARC file output Jan 7, 2018

@pirate pirate added this to the v0.0.4 milestone Jan 9, 2018

@pirate

This comment has been minimized.

Copy link
Owner

commented Jan 9, 2018

Just requires adding a new config FETCH_WARC option and archive_method.fetch_warc:

https://www.archiveteam.org/index.php/Wget_with_WARC_output

@anarcat

This comment has been minimized.

Copy link

commented Sep 13, 2018

for the record, pywb has trouble reading wget WARC file as its output is non-standard: webrecorder/pywb#294 you might want to consider another crawler for the task or see that wget fixes their stuff first.

@f0086

This comment has been minimized.

Copy link
Contributor

commented Nov 23, 2018

The bookmark-archiver is mentioned in a recent LWN article:
https://lwn.net/Articles/766374/

WARC archives seems the be the way to go for archiving webpages, especially for SPAs and other JS-heavy and streaming pages. Wget is -- according to that article -- a little broken in terms of WARC output, but there are a list of alternatives which works fine.

It would be very nice if bookmark-archiver get support for WARC archives.

@pirate

This comment has been minimized.

Copy link
Owner

commented Nov 23, 2018

I saw the article, and I actually emailed @anarcat about it since it turns out we both live in Montreal.

I think if I add WARC saving I want to do it with https://github.com/internetarchive/brozzler and a headless browser to cover all the edge cases around JS execution.

@FiloSottile

This comment has been minimized.

Copy link
Contributor

commented Dec 1, 2018

The easiest way to do this might be to just point the headless browser you already run to a WARC recording proxy. However brozzler looks like a nicely maintained piece of software (and way better than the old Java stuff the IA was using before), so maybe migrating to that can be a longer term goal.

EDIT: just noticed there's #113 for brozzler already.

@pirate pirate removed the good first issue label Dec 7, 2018

@pirate

This comment has been minimized.

Copy link
Owner

commented Jan 11, 2019

Wget WARC file output is now supported in e8808b0, via the FETCH_WARC=True flag (on by default).

I'll investigate creating another a more complete warc including dynamic requests with the chrome proxy, but for now the simpler wget one is worth testing in the interim.

@pirate pirate closed this Jan 11, 2019

@FiloSottile

This comment has been minimized.

Copy link
Contributor

commented Jan 13, 2019

WARCs are nice for bundling all resources needed to render a page in a browser, but using wget generates a WARC which is just a container for the HTML. Maybe open a new issue for the browser integration?

@pirate

This comment has been minimized.

Copy link
Owner

commented Jan 13, 2019

Yeah, actually I think I'll turn it off by default for now because it's too minimal to be a default until we get warcprox working with headless.

Here's the new issue to track the all-in-one WARC file #130

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.