Add WACZ support #770

machawk1 · 2022-05-24T20:03:34Z

Cleanup directories created for WARC files extracted from WACZ upon completion of indexing. The files are deleted but the directories remain and the docs say that the creator is responsible.

see more TODOs below before this PR can be merged

machawk1 · 2022-05-24T20:04:23Z

Pending a test passing and probably some better documentation, but I would appreciate your feedback on the diff here, @ibnesayeed.

ibnesayeed

I would prefer we handle it by externally extracting WACZ files and then indexing those WARCs the usual way until we have a built-in WARC record iterator in py-wacz (an idea that I discussed with @ikreymer already), at which point this change will become obsolete.

That said, I do not have any strong objections against this PR. I have added some inline comments for potential improvements though.

ipwb/util.py

ipwb/indexer.py

Re:#770 Re:#710

Re:#770, #710

ibnesayeed · 2022-06-02T21:46:41Z

ipwb/indexer.py

@@ -171,6 +186,8 @@ def index_file_at(warc_paths, encryption_key=None,
    cdxj_metadata_lines = generate_cdxj_metadata(cdxj_lines)
    cdxj_lines = cdxj_metadata_lines + cdxj_lines

+    cleanup_warc_files_extracted_from_wacz(warc_paths_to_append)


Not a big issue, but I think the temporary folders created by the mkdtemp() call will continue to exists (until cleaned up by the OS) because only the files inside them are deleted, not the folders themselves.

The docs say that the creator is responsible for the deletion, so I think we should handle this. Given each WARC gets a new temp directory, it might be better to just retain the copy of this directory path and delete it along with its contents instead of deleting the WARC then the directory, which would require tracking the directory path, too. Which approach would you rather be implemented, @ibnesayeed?

which would require tracking the directory path, too

Not really! It is possible to get the path of the directory if the path of a file is known that it contains.

That said, I would perhaps preferred not holding onto the list of WARC files, instead, operate on each WARC as we discover them, whether those are regular WARC files or those extracted from WACz files. I would deal with one file at a time and loop over for the next one.

@ibnesayeed This seems like it requires a revamp outside of the scope of this GH issue/PR. I agree that dealing with one WARC at a time would likely be more computationally optimal.

I understand that it would require change in the workflow. When done, it would be more space efficient as not all the WARCs need to be extracted from WACZ files upfront, duplicating them on the disk, before processing them.

It is okay to leave it as things are right now and get back to this when we have a WARC record iterator for the WACZ files, when most of these changes will be rendered useless.

ikreymer · 2022-06-06T18:23:14Z

While its nice to see more support for WACZ, I am wondering about how you're seeing support the WACZ format in this implementation.

It seems like this PR is simply treating WACZ as just ZIP containing some WARC files.
If that is the case, why not just add documentation for users to import a WACZ file by using:

unzip somefile.wacz
ipwb index archive/*.warc.gz

But a WACZ file is more than just a zip of WARC files! More importantly, it has a builtin CDXJ in a specific format, which allows for fast access and does not require reindexing of the WARCs. It also has metadata, fixity information (payload and full record digests), and optional signature. There is a specific structure, extending the frictionless data package, and validated by py-wacz. (Of course, this is all being developed/improved).

But if everything but the WARCs are being discarded, I wonder if that should be considered 'supporting WACZ' or just 'importing WARCs from WACZ file'.

Can you see a way for ipwb to support more of the WACZ format, such as reusing the existing CDXJ indexes perhaps?

In some ways, the current approach is a lossy transformation: WACZ files can already be hosted and used directly from IPFS. By importing into ipwb, you're taking the WARCs from the WACZ, reindexing them, and putting just the WARCs onto IPFS in a slightly different way, while creating a new index that is not stored on IPFS, and discarding other data (metadata, signatures, etc...). I wonder if ipwb could do something else to better leverage the properties of the WACZ format?

machawk1 · 2022-06-06T18:33:35Z

@ikreymer This initial implementation is base support for WACZ. We are hoping to do more with the format in the future beyond treating it as a container for WARCs.

Reusing the CDXJ in the WACZ for ipwb's own generated CDXJ index could be interesting. We have not thought about it much yet.

@ibnesayeed likely has some ideas / 2¢. Thanks for your input!

ibnesayeed · 2022-06-07T12:55:07Z

It seems like this PR is simply treating WACZ as just ZIP containing some WARC files.
If that is the case, why not just add documentation for users to import a WACZ file by using:

@ikreymer that would have been my preference as well (as noted in a comment earlier #770 (review)), but @machawk1 wanted a more built-in approach, hence this PR.

But if everything but the WARCs are being discarded, I wonder if that should be considered 'supporting WACZ' or just 'importing WARCs from WACZ file'.

In that sense IPWB does not even support WARC files, let alone WACZ files. During the indexing and ingestion process, it iterates over each WARC record, splits the headers and payload to store them as small IPFS objects to be used at the time of replay. This is why we are interested in an iterator over the WARC records and not the container (or the container of the container) format.

Can you see a way for ipwb to support more of the WACZ format, such as reusing the existing CDXJ indexes perhaps?

We can leverage the CDXJ file packaged inside the WACZ file, but not for replay, only as a means to iterate over WARC records during indexing/ingestion. However, for large WACZ files this approach might be slow because the reader pointer will be jumping around to seek to different locations if the CDXJ file is not sorted in the order those records are preserved. Moreover, if the built-in CDXJ excludes certain WARC record types (which might not be the case, but worth mentioning here) then those will not be reachable.

In some ways, the current approach is a lossy transformation: WACZ files can already be hosted and used directly from IPFS. By importing into ipwb, you're taking the WARCs from the WACZ, reindexing them, and putting just the WARCs onto IPFS in a slightly different way, while creating a new index that is not stored on IPFS, and discarding other data (metadata, signatures, etc...). I wonder if ipwb could do something else to better leverage the properties of the WACZ format?

I am not sure if I would call it lossy, because the goal of IPWB is different from some of the other web archiving systems. It operates on the atomic level of records while gluing related pieces together for replay. IPWB was not modeled to interact with WARC/WACZ or any other container/bundle format at the time of replay. For long term preservation, one is encouraged to retain WARC/WACZ files with as much provenance and metadata as possible. IPWB can be improved to incorporate more metadata followed by re-ingesting WARC files from the cold storage.

Re:#770

machawk1 · 2022-06-27T21:50:16Z

@ibnesayeed I would like your re-review here, as I added some logic to retain the temp paths and remove them as required so as to not have side effects. Inferring these might be unreliable, as the WARCs extracted from a WACZ are stored at {unique_temp_path}/archive/warcname.warc.gz rather than simply {unique_temp_path}/warcname.warc.gz.

If you would rather we change this to the latter scenario, which should not cause a clash issue due to the unique paths, let me know and I can cleanup the logic.

Otherwise, the changes here meet the requirements of #710. The latter scenario can always come in with refactoring.

machawk1 · 2022-06-27T21:53:27Z

Another caveat is whether subdirectories beyond /archive are legal for WARC storage in WACZ. This would cause any removal of the structure in WACZ not being retained in the on-disk WARC to introduce a clash.

I would have to check the WACZ spec to see if additional organization like this is legal. If not, the cleaner way would be to removal the /archive prefix when the WARC is extracted to disk.

ibnesayeed

I feel like we have some unnecessary code because we are doing things the complicated way. For example, we are maintaining list of extracted WARC files from WACZs to be removed later, then we also maintain list of temporary directories to be removed, though removing the later will automatically remove its contents.

I am not sure if merging this PR would be a good idea as it brings a lot of complexities that will be difficult to maintain and reason about and at some point they might become useless (if and when we have WARC record iterator built in the WACZ library).

ibnesayeed · 2022-06-29T22:02:29Z

ipwb/indexer.py

+    wacz_paths = []
+    for warc_path in warc_paths:
+        if is_wacz(warc_path):
+            (w_paths, dirs_to_cleanup) = extract_warcs_from_wacz(warc_path)


The dirs_to_cleanup here is overwritten in each loop, so at the end it will only hold the reference to the temporary dirs of the last WACZ file (unless I am missing something) for cleanup.

Good catch on dirs_to_cleanup not being retained.

I think the logic of extracted WARCs and temp directories can be simplified. There is a bit too much disk maintenance. The general gist is that paths to WACZ files could be inter-mingled with WARCs here and thus the WARCs extracted from the WACZ files need to be removed but not the WARCs that were passed in.

machawk1 · 2023-02-28T20:27:39Z

Change process of handling one WARC at a time from extraction, ingestion, extraction, ipwb magic, deletion before moving on to next WARC in the WACZ.

machawk1 added 6 commits May 17, 2022 13:18

Add preliminary support for ingesting WACZ files

779978a

Re:#710

Fix typo in argument

e82a4f7

Replace sample var in WARC path

133a9e4

File list of warc appending for #710

9436999

Add sample wacz file, derived from samples in warc samples

a422b47

Add test for wacz

25e91ad

machawk1 marked this pull request as draft May 24, 2022 20:03

machawk1 requested a review from ibnesayeed May 24, 2022 20:03

Tweak warc path for testing re:#710

f9447d2

machawk1 marked this pull request as ready for review May 25, 2022 20:15

ibnesayeed reviewed Jun 1, 2022

View reviewed changes

ipwb/util.py Outdated Show resolved Hide resolved

ipwb/util.py Show resolved Hide resolved

ipwb/indexer.py Show resolved Hide resolved

machawk1 added 2 commits June 2, 2022 15:49

Delete WARCs extracted from WACZ for cleanup post-indexing

2566b7c

Re:#770 Re:#710

Store extracted WARCs in WACZ to temp location

e4b9998

Re:#770, #710

machawk1 requested a review from ibnesayeed June 2, 2022 21:04

ibnesayeed reviewed Jun 2, 2022

View reviewed changes

Rm temp path created for WARC, be a good citizen

7e2589d

Re:#770

machawk1 requested a review from ibnesayeed June 27, 2022 21:47

Fix pep8 length issue

038d2bb

ibnesayeed reviewed Jun 29, 2022

View reviewed changes

machawk1 marked this pull request as draft February 28, 2023 20:28

machawk1 linked an issue Apr 10, 2023 that may be closed by this pull request

Support indexing WACZ files #710

Open

machawk1 added 2 commits June 18, 2024 15:46

Merge branch 'master' into issue-710

b6e789a

Merge branch 'master' into issue-710

aae8b39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WACZ support #770

Add WACZ support #770

machawk1 commented May 24, 2022 •

edited

Loading

machawk1 commented May 24, 2022

ibnesayeed left a comment

ibnesayeed Jun 2, 2022

machawk1 Jun 3, 2022

ibnesayeed Jun 4, 2022

machawk1 Jun 6, 2022

ibnesayeed Jun 6, 2022

ikreymer commented Jun 6, 2022 •

edited

Loading

machawk1 commented Jun 6, 2022

ibnesayeed commented Jun 7, 2022 •

edited

Loading

machawk1 commented Jun 27, 2022

machawk1 commented Jun 27, 2022

ibnesayeed left a comment

ibnesayeed Jun 29, 2022

machawk1 Jun 30, 2022

machawk1 Jun 30, 2022

machawk1 commented Feb 28, 2023 •

edited

Loading

Add WACZ support #770

Are you sure you want to change the base?

Add WACZ support #770

Conversation

machawk1 commented May 24, 2022 • edited Loading

machawk1 commented May 24, 2022

ibnesayeed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ikreymer commented Jun 6, 2022 • edited Loading

machawk1 commented Jun 6, 2022

ibnesayeed commented Jun 7, 2022 • edited Loading

machawk1 commented Jun 27, 2022

machawk1 commented Jun 27, 2022

ibnesayeed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machawk1 commented Feb 28, 2023 • edited Loading

machawk1 commented May 24, 2022 •

edited

Loading

ikreymer commented Jun 6, 2022 •

edited

Loading

ibnesayeed commented Jun 7, 2022 •

edited

Loading

machawk1 commented Feb 28, 2023 •

edited

Loading