Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support indexing WACZ files #710

Open
machawk1 opened this issue Aug 13, 2020 · 8 comments · May be fixed by #770
Open

Support indexing WACZ files #710

machawk1 opened this issue Aug 13, 2020 · 8 comments · May be fixed by #770

Comments

@machawk1
Copy link
Member

machawk1 commented Aug 13, 2020

Via @ikreymer, Web Archive Collection Zipped (WACZ) Format, https://github.com/webrecorder/wacz-format (MIT, potentially reusable)

Example of MDN WACZ at https://twitter.com/webrecorder_io/status/1293730279824089088

https://dh-preserve.sfo2.cdn.digitaloceanspaces.com/webarchives/mdn.wacz (1.6GB)

Finalizing Issue #604 (resolving #631) would be conducive here depending on the WACZ's contents. Also, hosting some larger WARCs remotely like this, because they are beyond the size restrictions on GitHub, could serve as the means for testing scalability.

@machawk1
Copy link
Member Author

WACZ files can be interpreted as a ZIP file with a defined structure. The target for ipwb (WARCs) are in /archive. Thus, the WACZ file should be read, interpreted as a ZIP, the WARC files in /archive extracted, and said files sent to the ipwb indexer.

In the future, we may want to consider the additional context that WACZ provides.

Sample WACZ https://play.archipelago.nyc/do/10/iiif/3546d9bd-a25c-4ba1-b96f-29411c0d752a/full/full/0/etd.wacz

@machawk1 machawk1 self-assigned this May 17, 2022
@machawk1
Copy link
Member Author

Preliminary support added in 779978a. WACZ detection should be improved but importing py-wacz incurs others dependencies due to pywb coupling.

@machawk1
Copy link
Member Author

Also, is_zipfile() fails on WACZ files due to the magic number (signature) not matching that of a ZIP file.

machawk1 added a commit that referenced this issue May 17, 2022
@machawk1
Copy link
Member Author

In 9436999, I created a wacz using:

wacz create -o ./samples/wacz/my-collection.wacz ./samples/warcs/5mementos.warc ./samples/warcs/froggie.warc.gz ./samples/warcs/salam-home.warc

...which produces a 79 KB file. Attempting to replay this in https://replayweb.page/ shows no URLs in the interface.

@ikreymer
Copy link

ikreymer commented May 17, 2022

wacz create -o ./samples/wacz/my-collection.wacz ./samples/warcs/5mementos.warc ./samples/warcs/froggie.warc.gz ./samples/warcs/salam-home.warc

The command should include a -f before the WARCs files now (and should have better arg validation probably)
try:

 wacz create -o ./samples/wacz/my-collection.wacz -f ./samples/warcs/5mementos.warc ./samples/warcs/froggie.warc.gz ./samples/warcs/salam-home.warc

@machawk1
Copy link
Member Author

machawk1 commented May 17, 2022

@ikreymer Thanks for your proactive feedback here. I ran:

wacz create -o ./samples/wacz/my-collection.wacz -f ./samples/warcs/5mementos.warc ./samples/warcs/froggie.warc.gz ./samples/warcs/salam-home.warc

and a 79 KB file
my-collection.wacz.zip (.zip added only for GitHub upload) was generated. This WACZ does not cause any URLs to be recognized in replayweb.page.

wacz 0.4.6 installed via pypi, macOS 12.3.1, Python 3.10.4

- -

EDIT: When decompressing the WACZ, the WARCs are present. Perhaps pywb is having an issue replaying them -- they were not created w/ the webrecorder stack.

EDIT2: Uploading the WARCs directly to replayweb.page produces the same result -- no URL is shown in the interface. A next step will be to try these WARCs in pywb directly to see if any errors are reported.

EDIT3: warcio seems to work ok with these WARCs, for example:

from warcio.archiveiterator import ArchiveIterator
  with open ('./samples/warcs/5mementos.warc', 'rb') as stream:
    for record in ArchiveIterator(stream):
      if record.rec_type == 'response':
        print(record.rec_headers.get_header('WARC-Target-URI'))

produces:

http://memento.us/
http://memento.us/
http://memento.us/
http://memento.us/
http://memento.us/
http://someotherURI.us/
http://anothersite.us/

@machawk1
Copy link
Member Author

Base test added in 25e91ad but GH Action is reporting service issues.

machawk1 added a commit that referenced this issue May 25, 2022
@machawk1 machawk1 linked a pull request Jun 27, 2022 that will close this issue
1 task
@machawk1
Copy link
Member Author

machawk1 commented Feb 7, 2023

Per a discussion w/ Mark G. @ IA, WACZ is supported at web-beta.archive.org/save for those with a "beta" account (which I have).

@machawk1 machawk1 linked a pull request Apr 10, 2023 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants