New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative idea: store WACZ files in ZIM (wacz2zim) #81
Comments
Thank you for this. I've skimmed through the WACZ spec and I kinda like it. My understanding is that you read+parse the index at the end of the ZIP file to guess/map the byte ranges of the actual data inside the WACZ. My concerns are:
Also important is going further away from the ZIM format will. That said, from a user perspective, the only drawback is (slightly?) larger ZIM files and (somewhat?) more CPU required while the maintenance burden seem to be greatly reduced. Question: would this require specific code for in-zim at webac/replayer level? |
Yes, this works due to the property of ZIP files, and I based it on this library: Since WARC files do not have a built in index, it is stored in the indexes/ directory (usually compressed main index and bin-searchable secondary index). Everything is accessible with range requests into the wacz file.
Well, it would be removing some things, but not everything. Would still keep the current UI (spinner, error messages, 404 page)
Right, would need to decide on how things are laid out in ZIM, but hopefully simpler
Yes, so that was my suggestion, is creating 'placeholder' html entries in the ZIM for each page that is searchable, which if loaded will just redirect to the service worker, which is what it does now. The WACZ will provide a list of pages in
My idea would be to just convert these to ZIM entries:
Then, I think it will probably work with the current Xapian text search, suggestions, random page, etc... These would be the only other entries added besides the replay UI files from
It should not, since it will just load the wacz over https like it does in any other case. Assuming standard The custom stuff then is just the UI, since we are not using ReplayWeb.page UI but already have a custom one with spinner, localization, sw checks, etc... |
OK, would it be possible to not store the extracted text in the pages.jsonl ? Unless I misunderstood something, this is full duplication of the complete text of the resource in the ZIM. When we'll feed that to xapian for its index, it's gonna be like 3 times in the ZIM file. It would be better to loop over WARC records when we create those fake zim entries, extract the text and feed it to xapian (using the new So I guess the question is more: can the search/suggest features of your replayer be completely disabled and thus remove the I forgot another concern: the fact that WACZ is your format while WARC is much more standard and adopted and I believe that played some role in the previous decision. |
Yeah, I guess we could delete the text from the pages.jsonl, or not generate it there. (This features is still being worked on as part of webrecorder/browsertrix-crawler#2)
Yes, it would be possible to do that, but it may be extra work without necessarily any benefit. I'm not sure what getIndexData does exactly or if it would be helpful. The text that would be in pages.jsonl would be extracted from the live web page as rendered in the browser. This is not necessarily the same as just extracting text from the HTML, ideally will have better results for dynamic pages as it will get the text directly from what's rendered in the browser. The pages.jsonl is also important because it lists what the top-level pages are, which is not necessarily every It would be possible to just look at the pages.jsonl, and then lookup those URLs in the WACZ and use that as the input for the text search. But, then you would just re-extract the text that is already being extracted, but it is doable. (We need to add URL lookup in the WACZ still or just iterate over the WARCs, which is longer).
The data is only used in the replayweb.page UI, so it won't be used in this custom UI. But as explained above, I think using it simplifies things, as it provides extracted text from the browser and title that can be plugged into whatever is needed in the ZIM, so less duplicate work.
Partly for this reason, the WACZ file preserves the full WARC files in the ZIP. If the goal is to convert back to WARC from a ZIM, it would be simpler to get a WARC out from the ZIM using WACZ (just extract it), then what we have now, since the current system transforms the WARC records and drops some of them.. |
OK
Well not storing the same content multiple (3!) times is certainly beneficial.
getIndexData is just something you provide the libzim so that it can index your Entry using a different input than the Entry's content. So Ideally we'd return the extracted text here for those fake redirections and benefit from full text indexing/search.
We don't want to get rid of the file but none of what you just mentioned requires anything but the URL.
Yeah I think the issue here is having a lower quality text (source vs document)
I think you misunderstood the issue. Those title and text are great but since we won't use your custom suggestion/search feature, this will be put in the ZIM index. With it in the index, that's a potentially quite large duplication.
@kelson42 can respond to that but I think the idea is more in reusing and leveraging existing formats and tools rather than competing with them, especially if that brings maintenance on our shoulders. WACZ is compatible with this IMO as this is just an additional layer on top of WARC. |
Ah, I see, that could work then. I wasn't sure if the default Xapian text/html extraction would just take care of it as well,
Yes, the browser has an API to return the DOM entries as JSON. From that, all of the text nodes can be filtered out recursively:
Ok, yes, that's definitely doable - we could read the pages.jsonl and remove that and readd the file, or simply preprocess before creating the zip. The WACZ is just a zip file after all, I think it should be ok to update it.
Agreed! |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
Coming back to this after a long while (more than 2years!). Because there are, incidentally, discussions about whether it is feasible to replay without Service Worker, by moving some of the logic from pywb into libkiwix the rest of my comment only applies to the current situation: a SW-based replayer running purely client-side. The question is thus: Whether to keep warc2zim as it is or to transform it into a wacztozim: use a single WACZ as input instead of a list of WARC files. DetailsCurrentIn the current situation, we iterate over WARC records and for each of them we:
[1] This checks where it is running from. inside a frame and the SW is installed? Great, do nothing. Otherwise, redirect to Home page (where SW will be installed) asking home page to redirect-back. Home page installs a SW into In the SW, urls are treated differently based on an arbitrary modifier: The SW works roughly in two steps:
[2] if there's no headers for that URL, the url goes into the FuzzyMatcher which returns the correct URL to the matched record. ℹ Sorry if this was a bit off-topic but serves as documentation 🤷♂️ With the current situation, the content of a ZIM would look like:
WACZ-basedOur sole input would be a single WACZ file. We'd read the
Then we would replace the Main differences would be:
The behavior of the SW and UI would be identical. We'd just change the replayer config to Also we have a With the WACZ-based ZIM, content would look like (for identical example):
Side Question: How does that Pros
Cons
It thus looks like a good choice, in the context of using an external (webrecorder's) replayer. |
Another CON: we go away from a well installed standard which is WARC. |
Updated my comment. WARC file is bundled into the WACZ so we carry a standard file but indeed the solution would not be from a standard format to a ZIM but from a WACZ to a ZIM. Both are documented standards of course but it's important to highlight that WARC is the well established and wide-spread one. |
In light of openzim/zimit#193 and openzim/zimit#194, I am closing this. We won't follow the path initially proposed in this ticket. Working off a WACZ or a collection of WARC files (or both) thus becomes an implementation detail. If WACZ takes off, it might be required to support it. I am thinking in particular about record+replay scenarios that fully-automated browser crawl can't master. |
I wanted to suggest an alternative approach to warc2zim, a wacz2zim, which may address some of the issues raised in #80
Webrecorder is developing the WACZ format (https://github.com/webrecorder/wacz-format) and we have a stable release for it. The WACZ format bundles WARC files, indexes and page data inside a ZIP file. The format is specifically optimized for reading in the browser, and the wabac.js system is designed to read the WACZ file on-demand. Browsertrix-Crawler will soon be updated to generate WACZ as well.
An idea would be to simply put the WACZ file directly into the ZIM, along with the other file used by the wabac.js sw, and placeholder html. For example, an archive of a youtube page might look something like
To support the existing redirect when no service worker is installed, and search , ZIM entries can still be added only for HTML pages, like
youtube.com/watch?...
. Since WACZ will already include a page list with extracted text, these entries can be generated from the WACZ pages file. These could just be bookmarks containing text + redirect and not the actual content. (Or, the actual content could also be used if want to rely on Xapian specifically for text extraction).The advantages of this system would be that no custom loading, or fuzzy matching, would be needed.
The loading path and the conversion would be a bit simplified, since latest wabac.js and wacz format can simply be packaged into the ZIM. This will also support for prefix queries that wabac.js normally relies on for reading wacz files.
The client will make requests to archive.wacz with specific ranges, and it would just be a binary file as far as the readers are concerned.
This should significantly lower the maintenance burden on wabac.js-based replay in ZIM, and probably should have minimal impact for users.
The main disadvantage is perhaps compression, as wacz is using gzip for now, and contains warc files inside of it, which also use gzip. There is a proposal to support zstd for warc, and that may eventually be supported (though probably not too soon).
This would require zstd decompression in JS as well.
I don't know how much of an issue it is, or what compression gains there, especially from sites that are heavy on video/images. My guess is that it is less than on a text-heavy site like wikipedia where zstd probably offers significant improvements.
With this approach, the system would be able to use prefix search over the wacz and also cache data in the IndexedDB (on-demand). It would also be beneficial to clear out the IndexedDB when a WACZ-based ZIM is uninstalled -- this is similar to uninstalling the sw, but somewhat simpler, and cache size can be adjusted.
The text was updated successfully, but these errors were encountered: