Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative idea: store WACZ files in ZIM (wacz2zim) #81

Closed
ikreymer opened this issue Jan 29, 2021 · 11 comments
Closed

Alternative idea: store WACZ files in ZIM (wacz2zim) #81

ikreymer opened this issue Jan 29, 2021 · 11 comments
Assignees
Labels
enhancement New feature or request question Further information is requested
Milestone

Comments

@ikreymer
Copy link
Collaborator

ikreymer commented Jan 29, 2021

I wanted to suggest an alternative approach to warc2zim, a wacz2zim, which may address some of the issues raised in #80

Webrecorder is developing the WACZ format (https://github.com/webrecorder/wacz-format) and we have a stable release for it. The WACZ format bundles WARC files, indexes and page data inside a ZIP file. The format is specifically optimized for reading in the browser, and the wabac.js system is designed to read the WACZ file on-demand. Browsertrix-Crawler will soon be updated to generate WACZ as well.

An idea would be to simply put the WACZ file directly into the ZIM, along with the other file used by the wabac.js sw, and placeholder html. For example, an archive of a youtube page might look something like

A/youtube.com/watch%2F... - just contains the text and loaded only initially for redirect
A/replay/archive.wacz - all actual data will be read from this file
A/replay/sw.js
A/replay/topFrame.html
... (other supporting file part of current replay in templates/)

To support the existing redirect when no service worker is installed, and search , ZIM entries can still be added only for HTML pages, like youtube.com/watch?.... Since WACZ will already include a page list with extracted text, these entries can be generated from the WACZ pages file. These could just be bookmarks containing text + redirect and not the actual content. (Or, the actual content could also be used if want to rely on Xapian specifically for text extraction).

The advantages of this system would be that no custom loading, or fuzzy matching, would be needed.
The loading path and the conversion would be a bit simplified, since latest wabac.js and wacz format can simply be packaged into the ZIM. This will also support for prefix queries that wabac.js normally relies on for reading wacz files.
The client will make requests to archive.wacz with specific ranges, and it would just be a binary file as far as the readers are concerned.
This should significantly lower the maintenance burden on wabac.js-based replay in ZIM, and probably should have minimal impact for users.

The main disadvantage is perhaps compression, as wacz is using gzip for now, and contains warc files inside of it, which also use gzip. There is a proposal to support zstd for warc, and that may eventually be supported (though probably not too soon).
This would require zstd decompression in JS as well.

I don't know how much of an issue it is, or what compression gains there, especially from sites that are heavy on video/images. My guess is that it is less than on a text-heavy site like wikipedia where zstd probably offers significant improvements.

With this approach, the system would be able to use prefix search over the wacz and also cache data in the IndexedDB (on-demand). It would also be beneficial to clear out the IndexedDB when a WACZ-based ZIM is uninstalled -- this is similar to uninstalling the sw, but somewhat simpler, and cache size can be adjusted.

@kelson42 kelson42 added enhancement New feature or request question Further information is requested labels Jan 29, 2021
@rgaudin
Copy link
Member

rgaudin commented Jan 29, 2021

Thank you for this. I've skimmed through the WACZ spec and I kinda like it. My understanding is that you read+parse the index at the end of the ZIP file to guess/map the byte ranges of the actual data inside the WACZ.

My concerns are:

  • throwing off most of what we did.
  • having to go through all the in-zim design discussion again
  • what about search? New libzim introduced segregation between record-content and indexed-data but this is tied to records. With this solution, we'd have only one giant record so that would mean no libzim search… No suggestion neither nor random for the same reason. I understand we could work around the search/suggest with fake (html redirect) records.

Also important is going further away from the ZIM format will.

That said, from a user perspective, the only drawback is (slightly?) larger ZIM files and (somewhat?) more CPU required while the maintenance burden seem to be greatly reduced.

Question: would this require specific code for in-zim at webac/replayer level?

@ikreymer
Copy link
Collaborator Author

ikreymer commented Jan 30, 2021

Thank you for this. I've skimmed through the WACZ spec and I kinda like it. My understanding is that you read+parse the index at the end of the ZIP file to guess/map the byte ranges of the actual data inside the WACZ.

Yes, this works due to the property of ZIP files, and I based it on this library:
https://github.com/Rob--W/zipinfo.js which provides the JS implementation and explanation.

Since WARC files do not have a built in index, it is stored in the indexes/ directory (usually compressed main index and bin-searchable secondary index).

Everything is accessible with range requests into the wacz file.

My concerns are:

  • throwing off most of what we did.

Well, it would be removing some things, but not everything.
Anything related to parsing WARCs will probably be removed.
But everything related to the service worker handling, and most of the logic/UI in templates/ would remain.

Would still keep the current UI (spinner, error messages, 404 page)

  • having to go through all the in-zim design discussion again

Right, would need to decide on how things are laid out in ZIM, but hopefully simpler

  • what about search? New libzim introduced segregation between record-content and indexed-data but this is tied to records. With this solution, we'd have only one giant record so that would mean no libzim search… No suggestion neither nor random for the same reason. I understand we could work around the search/suggest with fake (html redirect) records.

Yes, so that was my suggestion, is creating 'placeholder' html entries in the ZIM for each page that is searchable, which if loaded will just redirect to the service worker, which is what it does now.

The WACZ will provide a list of pages in pages/pages.jsonl file which will contain entries like this:

{"url": "https://example.com/page.html", "title": "Example Page", "text": "all extracted text here", "ts": ...}
{"url": "https://example.com/another.html", "title": "Other Page", "text": "more text here", "ts": ...}

My idea would be to just convert these to ZIM entries: /A/example.com/page.html,
which would contain:

<html>
<head>
<script><!--check for SW and redirect, current sw_check.html --></script>
<title>Example Page</title>
</head>
<body>
all extracted text here
</body>
</html>

Then, I think it will probably work with the current Xapian text search, suggestions, random page, etc...

These would be the only other entries added besides the replay UI files from templates/ and the .wacz file.

Also important is going further away from the ZIM format will.

That said, from a user perspective, the only drawback is (slightly?) larger ZIM files and (somewhat?) more CPU required while the maintenance burden seem to be greatly reduced.

Question: would this require specific code for in-zim at webac/replayer level?

It should not, since it will just load the wacz over https like it does in any other case. Assuming standard Range requests work, that should be it. Would be less special code than now, and so would improve maintenance.

The custom stuff then is just the UI, since we are not using ReplayWeb.page UI but already have a custom one with spinner, localization, sw checks, etc...

@rgaudin
Copy link
Member

rgaudin commented Jan 30, 2021

OK, would it be possible to not store the extracted text in the pages.jsonl ? Unless I misunderstood something, this is full duplication of the complete text of the resource in the ZIM. When we'll feed that to xapian for its index, it's gonna be like 3 times in the ZIM file.

It would be better to loop over WARC records when we create those fake zim entries, extract the text and feed it to xapian (using the new getIndexData feature of libzim) and setting the title as well.

So I guess the question is more: can the search/suggest features of your replayer be completely disabled and thus remove the title and text from the pages.jsonl ?

I forgot another concern: the fact that WACZ is your format while WARC is much more standard and adopted and I believe that played some role in the previous decision.

@ikreymer
Copy link
Collaborator Author

ikreymer commented Feb 1, 2021

OK, would it be possible to not store the extracted text in the pages.jsonl ? Unless I misunderstood something, this is full duplication of the complete text of the resource in the ZIM. When we'll feed that to xapian for its index, it's gonna be like 3 times in the ZIM file.

Yeah, I guess we could delete the text from the pages.jsonl, or not generate it there. (This features is still being worked on as part of webrecorder/browsertrix-crawler#2)

It would be better to loop over WARC records when we create those fake zim entries, extract the text and feed it to xapian (using the new getIndexData feature of libzim) and setting the title as well.

Yes, it would be possible to do that, but it may be extra work without necessarily any benefit.

I'm not sure what getIndexData does exactly or if it would be helpful. The text that would be in pages.jsonl would be extracted from the live web page as rendered in the browser. This is not necessarily the same as just extracting text from the HTML, ideally will have better results for dynamic pages as it will get the text directly from what's rendered in the browser.

The pages.jsonl is also important because it lists what the top-level pages are, which is not necessarily every text/html resource. In fact, indexing everything that is text/html leads to lots of false positive (some ad iframe, some other resource that defaults to text/html on old servers, etc..).

It would be possible to just look at the pages.jsonl, and then lookup those URLs in the WACZ and use that as the input for the text search. But, then you would just re-extract the text that is already being extracted, but it is doable. (We need to add URL lookup in the WACZ still or just iterate over the WARCs, which is longer).

So I guess the question is more: can the search/suggest features of your replayer be completely disabled and thus remove the title and text from the pages.jsonl ?

The data is only used in the replayweb.page UI, so it won't be used in this custom UI. But as explained above, I think using it simplifies things, as it provides extracted text from the browser and title that can be plugged into whatever is needed in the ZIM, so less duplicate work.

I forgot another concern: the fact that WACZ is your format while WARC is much more standard and adopted and I believe that played some role in the previous decision.

Partly for this reason, the WACZ file preserves the full WARC files in the ZIP. If the goal is to convert back to WARC from a ZIM, it would be simpler to get a WARC out from the ZIM using WACZ (just extract it), then what we have now, since the current system transforms the WARC records and drops some of them..

@rgaudin
Copy link
Member

rgaudin commented Feb 2, 2021

Yeah, I guess we could delete the text from the pages.jsonl, or not generate it there. (This features is still being worked on as part of webrecorder/browsertrix-crawler#2)

OK

It would be better to loop over WARC records when we create those fake zim entries, extract the text and feed it to xapian (using the new getIndexData feature of libzim) and setting the title as well.

Yes, it would be possible to do that, but it may be extra work without necessarily any benefit.

Well not storing the same content multiple (3!) times is certainly beneficial.

I'm not sure what getIndexData does exactly or if it would be helpful. The text that would be in pages.jsonl would be extracted from the live web page as rendered in the browser. This is not necessarily the same as just extracting text from the HTML, ideally will have better results for dynamic pages as it will get the text directly from what's rendered in the browser.

getIndexData is just something you provide the libzim so that it can index your Entry using a different input than the Entry's content. So Ideally we'd return the extracted text here for those fake redirections and benefit from full text indexing/search.
This browser-extracted text is very interesting (compared to source parsing). We should definitely index that.
Out of curiosity, how is it done? Is this just DOM manipulation to get text values from text nodes or is it something that the browser provides?

The pages.jsonl is also important because it lists what the top-level pages are, which is not necessarily every text/html resource. In fact, indexing everything that is text/html leads to lots of false positive (some ad iframe, some other resource that defaults to text/html on old servers, etc..).

We don't want to get rid of the file but none of what you just mentioned requires anything but the URL.

It would be possible to just look at the pages.jsonl, and then lookup those URLs in the WACZ and use that as the input for the text search. But, then you would just re-extract the text that is already being extracted, but it is doable. (We need to add URL lookup in the WACZ still or just iterate over the WARCs, which is longer).

Yeah I think the issue here is having a lower quality text (source vs document)

So I guess the question is more: can the search/suggest features of your replayer be completely disabled and thus remove the title and text from the pages.jsonl ?

The data is only used in the replayweb.page UI, so it won't be used in this custom UI. But as explained above, I think using it simplifies things, as it provides extracted text from the browser and title that can be plugged into whatever is needed in the ZIM, so less duplicate work.

I think you misunderstood the issue. Those title and text are great but since we won't use your custom suggestion/search feature, this will be put in the ZIM index. With it in the index, that's a potentially quite large duplication.
An option would be to remove it from the WACZ (keeping only urls) once this is is the ZIM but I'm not sure how practical that is.

I forgot another concern: the fact that WACZ is your format while WARC is much more standard and adopted and I believe that played some role in the previous decision.

Partly for this reason, the WACZ file preserves the full WARC files in the ZIP. If the goal is to convert back to WARC from a ZIM, it would be simpler to get a WARC out from the ZIM using WACZ (just extract it), then what we have now, since the current system transforms the WARC records and drops some of them..

@kelson42 can respond to that but I think the idea is more in reusing and leveraging existing formats and tools rather than competing with them, especially if that brings maintenance on our shoulders. WACZ is compatible with this IMO as this is just an additional layer on top of WARC.

@ikreymer
Copy link
Collaborator Author

ikreymer commented Feb 3, 2021

getIndexData is just something you provide the libzim so that it can index your Entry using a different input than the Entry's content. So Ideally we'd return the extracted text here for those fake redirections and benefit from full text indexing/search.

Ah, I see, that could work then. I wasn't sure if the default Xapian text/html extraction would just take care of it as well,
if its a fake entry, but yes, can just pass the text/title to zimwriter a different way if needed.

This browser-extracted text is very interesting (compared to source parsing). We should definitely index that.
Out of curiosity, how is it done? Is this just DOM manipulation to get text values from text nodes or is it something that the browser provides?

Yes, the browser has an API to return the DOM entries as JSON. From that, all of the text nodes can be filtered out recursively:
https://github.com/webrecorder/browsertrix/blob/pywb-instance/simple-driver/index.js#L234

I think you misunderstood the issue. Those title and text are great but since we won't use your custom suggestion/search feature, this will be put in the ZIM index. With it in the index, that's a potentially quite large duplication.
An option would be to remove it from the WACZ (keeping only urls) once this is is the ZIM but I'm not sure how practical that is.

Ok, yes, that's definitely doable - we could read the pages.jsonl and remove that and readd the file, or simply preprocess before creating the zip. The WACZ is just a zip file after all, I think it should be ok to update it.
The WACZ support for browsertris-crawler is currently being worked on, so there's definitely an opportunity to make sure it supports this use case as it is being added.

@kelson42 can respond to that but I think the idea is more in reusing and leveraging existing formats and tools rather than competing with them, especially if that brings maintenance on our shoulders. WACZ is compatible with this IMO as this is just an additional layer on top of WARC.

Agreed!

@stale
Copy link

stale bot commented Apr 12, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@rgaudin
Copy link
Member

rgaudin commented May 23, 2023

Coming back to this after a long while (more than 2years!).

Because there are, incidentally, discussions about whether it is feasible to replay without Service Worker, by moving some of the logic from pywb into libkiwix the rest of my comment only applies to the current situation: a SW-based replayer running purely client-side.

The question is thus: Whether to keep warc2zim as it is or to transform it into a wacztozim: use a single WACZ as input instead of a list of WARC files.

Details

Current

In the current situation, we iterate over WARC records and for each of them we:

  • Insert a piece of JS (sw_check.html) into WARC Payload of text/html entries [1]
  • Store that WARC Payload into A/{canonicalizedUrl}
  • Store the WARC Headers into H/{canonicalizedUrl}

[1] This checks where it is running from. inside a frame and the SW is installed? Great, do nothing. Otherwise, redirect to Home page (where SW will be installed) asking home page to redirect-back.

Home page installs a SW into xxx/A so that any A/ prefixed URL goes through it. Once installed, it redirects to that URL so it gets handled by the SW (window.location.href = prefix).

In the SW, urls are treated differently based on an arbitrary modifier: mp_. Requests without mp_ are rewritten to be our UI (topFrame.html) two key variables set: prefixand startUrl and the iframe it contains is pointed at the mp_ version (iframe.src = prefix + "mp_/" + startUrl;)

The SW works roughly in two steps:

  1. SW makes a (regular) request to that */canonicalizedUrl, using the crawled request headers [2] and returns the response (body and headers) from the A and H entries.
  2. It rewrites the returned response (for mp_ ones) to:
  • insert all the context-specific wombat info (what pywb adds in the templates)
  • insert a wombat include

[2] if there's no headers for that URL, the url goes into the FuzzyMatcher which returns the correct URL to the matched record.

ℹ Sorry if this was a bit off-topic but serves as documentation 🤷‍♂️

With the current situation, the content of a ZIM would look like:

A/404.html
A/accounts.google.com/ListAccounts?gpsia=1&source=ChromiumBrowser&json=standard&__wb_method=POST&
A/content-autofill.googleapis.com/v1/pages/ChRDaHJvbWUvMTEyLjAuNTYxNS40ORIeCXjz9Q1zsMP8EgUNnpR2qRIFDdOVJAISBQ0MdULB?alt=proto
A/index.html
A/isago.rskg.org/
A/isago.rskg.org/a-propos
A/isago.rskg.org/conseils
A/isago.rskg.org/faq
A/isago.rskg.org/static/favicon256.png
A/isago.rskg.org/static/tarifs-isago.pdf
A/load.js
A/maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css
A/sw.js
A/topFrame.html
H/accounts.google.com/ListAccounts?gpsia=1&source=ChromiumBrowser&json=standard&__wb_method=POST&
H/content-autofill.googleapis.com/v1/pages/ChRDaHJvbWUvMTEyLjAuNTYxNS40ORIeCXjz9Q1zsMP8EgUNnpR2qRIFDdOVJAISBQ0MdULB?alt=proto
H/isago.rskg.org/
H/isago.rskg.org/a-propos
H/isago.rskg.org/conseils
H/isago.rskg.org/faq
H/isago.rskg.org/static/favicon256.png
H/isago.rskg.org/static/tarifs-isago.pdf
H/maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css

WACZ-based

Our sole input would be a single WACZ file. We'd read the pages/pages.jsonl and iterate over the entries. For each entry we would:

  • read the title and text properties and create a {canocicalizedUrl} entry for it, using the title. text would be fed to getIndexData. The entry would be a stub HTML that redirects to the viewer-based URL for that URL.
  • update pages_jsonl to remove the title and text properties. (Do we need the size?)

Then we would replace the pages/pages.jsonl with a dump or our updated (trimed-down) pages_jsonl.
It's then time to add the WACZ file to the file, along with the replay files, as with the current version.

Main differences would be:

  • We don't read WARC records anymore
  • We don't add WARC Headers Record anymore
  • We don't add WARC Payload Records anymore
  • We don't have to edit Payload records (we were injecting sw_check)
  • We don't have to add FuzzMatch redirects (fuzzy matching only done in replayer)

The behavior of the SW and UI would be identical. We'd just change the replayer config to type=wacz instead of the default remotewarcproxy and the special kiwix tweak can most likely be removed from wabac.js (except maybe the uppercase Range thing).

Also we have a custom-css feature that we never use (we inject a <style /> next to sw_check). wabac.js has a similar feature that is applied at runtime so we'd use that instead.

With the WACZ-based ZIM, content would look like (for identical example):

archive.wacz
404.html
index.html
isago.rskg.org/
isago.rskg.org/a-propos
isago.rskg.org/conseils
isago.rskg.org/faq
isago.rskg.org/static/favicon256.png
isago.rskg.org/static/tarifs-isago.pdf
load.js
sw.js
topFrame.html

Side Question: How does that pages.jsonl file scale? Do you also fetch-and-read it progressively or do you load it completely in your replayer? What would happen with tens of millions of entries?

Pros

  • Reduced warc2zim codebase. We would not read WARC entries (except seed one for language and favicon fallback)
  • Better Suggestion/Random/Articles count feature: the pages list being a lost of pages and not just text/html records, we can safely (or we better confidence) set FRONT_ARTICLE to all our stub entries and we shouldn't get broken HTML responses.
  • Better search results. pages list contains browser-extracted content which is better that our in-libzim HTML source text extractor.
  • Less processing: we'll add fewer HTML entries and those will have the indexData ready. Less processing for libzim/xapian
  • Faster: we won't iterate over all records but just read a single JSON file. Tons of saved IO
  • Less custom code and better integration for webrecorder: reduced risk of our custom-stuff breaking inadvertently.
  • Some (weird) users could download the WACZ and use it independently

Cons

  • We don't do it now but Better auto-detection of multilanguage content zimit#187 mentions looking for languages in all entries (although I am against doing that). That would require reading all entries.
  • Needs to be implemented.
  • Number of media files (Counter metadata) would be completely off.
  • Arguably less accessible content since ZIM entries would not contain the HTML payload. (Current situation do contain it but it's hardly usable as there's this redirect that we inject).
  • Based off a very young format while WARC is well established.
  • Slower runtime and higher CPU usage for users as it implies more network requests (I suppose) and decompression is done in JS in the browser
  • Compression is lower than zstandard

It thus looks like a good choice, in the context of using an external (webrecorder's) replayer.

@kelson42
Copy link
Contributor

Another CON: we go away from a well installed standard which is WARC.

@rgaudin
Copy link
Member

rgaudin commented May 23, 2023

Another CON: we go away from a well installed standard which is WARC.

Updated my comment. WARC file is bundled into the WACZ so we carry a standard file but indeed the solution would not be from a standard format to a ZIM but from a WACZ to a ZIM.

Both are documented standards of course but it's important to highlight that WARC is the well established and wide-spread one.

@rgaudin
Copy link
Member

rgaudin commented May 31, 2023

In light of openzim/zimit#193 and openzim/zimit#194, I am closing this.

We won't follow the path initially proposed in this ticket. Working off a WACZ or a collection of WARC files (or both) thus becomes an implementation detail. If WACZ takes off, it might be required to support it. I am thinking in particular about record+replay scenarios that fully-automated browser crawl can't master.

@rgaudin rgaudin closed this as completed May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants