New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discuss future implementation #48
Comments
I disagree with both approach.
ZIM format is a specific format with a specific way of working and requirements. Content in a zim format must comply with this. (dot). Some of those requirements are :
Those requirements were somehow never really written and are more kiwix requirements than zim ones. But anyway, we want to access the content using kiwix tools. The warc2zim should do what its name implies : Transform content using warc format into content using zim/kiwix format. From what I understand of warc format is that is it a format to store a web navigation (a browser/scrapper navigating on a web site).
It is nice because the scrapper doesn't have to be smart (a least on this part). Just put what it has done in the warc. The player (we use https://github.com/webrecorder/wabac.js and especially the kiwix branch) is the smart part putting everything together. It acts as a proxy, intercepting the request from the browser, get the headers and data corresponding to the request from the kiwix provider (kiwix-serve/the hook providing content not native application), and send back a response to the browser with the recorded response/content. Note that request On the opposite side, zim format store plain content and the kiwix player act as a real server :
The warc2zim should be the smart part and transform a warc format in a zim format (following kiwix specification) :
Then it would introduce no change at all on kiwix side and it would work in all our platforms. |
What you propose offers no solution to the dynamic requests problem. Having the response in the ZIM is not the problem, as those are in the WARC already. The problem is that we can't rewrite JS code that would lead to a request for https://something.outside.zim/path and thus, when reading the ZIM, the browser will try to access those links without going through the ZIM. |
That is the real spec, we should have start from that :) What we need, is to store a mapping Once we have this
Where to store this mapping ? The simpler is probably a specific metadada. Readers would have to read the metadata and parse it to handle it (kiwix-lib/libzim can help here). How to build this mapping ? Scrappers already does it and store this in a warc file, let
We really need clarification. |
Good. We're on the same page. Regarding the headers stuff, it's just a hunch based on the fact that WARC stores them but it might not be a requirement. @ikreymer would clear that up. |
The WARC file is just a storage format, like ZIM, and is designed to store HTTP request/response data, as well as other arbitrary data. Like ZIM, it can be used in an number of different ways.
I understand that the ZIM format was created for MediaWiki, but unfortunately, the rest of the web does not work like MediaWiki. I think this approach works well as long as you are dealing with static sites only with no Javascript, like MediaWikis.
If only it worked that way... :) Yes, you could store the 206 responses in the WARC, it is allowed, but it is not a good idea because it will be hard to replay. The WARC spec allows storing any request/response in the WARC of course, how best to do so is up to the capture and replay tools. All Webrecorder tools avoid storing 206 responses for example, or filter them out, to only serve the full 200 record. This is necessary to make videos work, for example. But this isn't the main issue.
Yes, the replay wabac.js handles range requests as well, though cacheing is disabled for now. The ideal solution is to store the raw data as much as possible, and rewrite as needed in the player.
This is what's being done now with But its not just about loading exact urls, and its not that simple :) There's a lot more involved! For instance, many sites require the I think a main question is: what kind of sites do you want to capture? If most of your sites are like Bouquineux.com, then I agree, you don't need any of this. But then you have sites like, CK12, ex, https://flexbooks.ck12.org/cbook/ck-12-precalculus-concepts-2.0/section/1.3/primary/lesson/point-notation-and-function-notation-pcalc which are highly dynamic, and include embedded youtube videos. These are a challenge to capture and replay, and the wabac.js is designed to deal with replaying sites like this. My guess is that more and more (non mediawiki) sites will start looking like CK12, and not like Bouquinuex, I would like to suggest an alternative solution, simply storing the WARCs, or rather WACZ file directly in the ZIM. To interface, with the existing Xapian search, perhaps entries for the searchable pages are made, and all they do is redirect to the service worker, as we have now, and are there only to store the text -- the replay will be handled by the system. I am also thinking about maintenance, and this would be the least maintenance approach, as long as all platforms can support service workers. I think this would avoid having different implementations in different platforms, which I think is going to be a maintenance nightmare. An even better approach would be to support loading wabac.js directly in kiwix-lib. Unfortunately, It's simply not possible to get sites like CK12 w/ dynamic content and youtube videos to work w/o service workers, or without doing a whole lot of custom rewriting work for each site. It is up to you to decide if that is important or not for what you're trying to do. |
Thanks for the feedback @ikreymer. |
Regarding the requirements, I think they are not different than wabac.js overall iff you want to support any website
To summarize from above:
The wabac.js already handles this, but updates will always be needed. The wabac.js |
On the other end, for sites like bouquineux.com, which are all static and have no videos, no javascript, it would be easy to add a 'classic zim' mode where warc2zim produces 'standard' zims, eg. I even started prototyping this 'simple' mode initially for warc2zim, though @kelson42 said it was not needed, but it could be done. It will work fine for bouquineux.com and sites like it, and could warn if classic mode will not work for a certain approach. I think the two options should be:
|
There is a lot of things here. I've try to regroup this into sections. js in website
Mediawiki is not a static site, there is js running in it. zim (as warc) replace the server part. Nothing prevent us to have complex js running on a site web embeded in a zim. What we need is to have all resources in a zim to be able to provide them to the web site in place of the server. This means that we need to know them. It is the work of the scrapper but the problem is the same for zim and warc.
Yes, of course. My main idea was to say that warc and zim was designed differently. zim store resources and warc records exchange. It doesn't mean you cannot store curated exchanges (and I hope you do it, the same way we store curated resources)
And it is potentially a problem.
Can you be more specific ? js can run in a embedded site in a zim file. website will do requests, if we answer the same content for each requests made everything should work. I understand that the problem is to determine want is the real identifier of a content in a request. If our "servers" have to do it regularly (parse headers, ...) we can move that in the kiwix-lib instead of letting all application implement that.
You need to be more clear about that. And even if we must rewrite it. If wabac.js can know that it have to do the rewrite, we can also know it at zim creation and handle that (store the content of Adapt to browsers/specific website.
There are two points here.
I never seen a (major) browser breaking compatibility with old website.
Why ? I understand that wabac.js need to be update because website change. But the content in a zim doesn't change. How we do this.
Yes, we have made choices and they may be not the best. Splitting data between 'A' and 'H' was a decision made based on incomplete understanding of the need and the technologies involved. By definition it a bad decision (even if the solution is the good one).
I agree that we should handle such things in the client side. But I'm not sure SW is the solution. (because it doesn't work everywhere, and we know it leads to some issue with other features).
But service workers will not work on ios (at least). And this is understandable. What do we want to support ?
Well, we could also simply make our application play warc files directly. No need for zim at all in this case.
Do we really want to support that. If site require such thing it is probably because they don't want to be replayed/embedded. Do we want to fight against them ?
Ok, timestamp is a reason to have something changing. But how many site use the current time to identify a content ? How can they identify one content with a constant changing key ? Or the content is all the time changing. Or the timestamp in the url is not part of the identifier. I don't know how to handle that for now. But how many site does this ? |
Hi All, I'm excited to see the rapid progress on warc2zim. Thank you for holding these discussions in the open, so we can all learn from them. I want to provide some experience from packaging content for offline use in the "high-touch regime" (not just archiving as-is, but using custom scrapers specific to the source website). Here is the the CK-12 flexbook article discussed rendered in Kolibri:
The main problems with this "high touch" approach of rewriting of the web content are:
This is why I'm so excited by the potential for an automated archiver and scraper tool that works on most sites and doesn't require custom code for each site.
Perhaps we could subdivide option 1 into two:
I'm not sure how feasible approach 1b. would be, but it seems a js-enabled browser + post processing could work for A LOT basic content sites. And for cases where basic js-enabled browser + resource links rewriting doesn't work, the option 2 will be available with full replay power. @rgaudin @mgautierfr Do you have an idea of what websites you want to be able to scrape using warc2zim? Do you need to solve the general problem of scraping arbitrary websites or can you focus on simpler websites with mostly content and only a bit of javascript? If you have a "to scrape" list in mind that you can share, I'd like to take a look as I wonder if there is any overlap with the channels in the Kolibri Content Library, in which case I can look into creating something like |
Hi @ivanistheone, thanks for sharing your approach as well. I got that particular example from looking at recent requests here: https://github.com/openzim/zim-requests/issues The intent of this project has been for Kiwix to avoid having to write custom scraper/crawlers for these sites. Thanks for sharing your rendering of this site. For comparison, here is a ZIM replay of the same page, created with latest warc2zim and using kiwix-serve: It required no custom work on this site (though I did find a generic bug in wabac.js which prevented the notations from working initially). Unfortunately, the youtube videos do not load with the current setup, due to lack of fuzzy/inexact matching support - the youtube URLs ignore some sort of dynamic hash. Here is the same page using the full replayweb.page system that I've been working on: This should work with the youtube videos playing, in Chrome and Firefox. In Safari, unfortunately, the videos currently don't play as Safari doesn't support the encoding youtube uses for Chrome and FF, so that's an extra step that'd be needed to capture in a different format probably, or do a conversion (the video is loaded from an mp4). But, the page and equations should load in all modern browsers which support service workers, including mobile, and requires no server-side infrastructure. (It should work in mobile safari, though not the videos yet, due to the format issue) The static file is just loaded from: I think this serves to demonstrates the spectrum of possibilities that we have available:
|
@ivanistheone Thank you for participating to this discussion. Here is the list of Web sites we want to scrape with Zimit (warc2zim) https://github.com/openzim/zim-requests/issues?q=is%3Aopen+is%3Aissue+label%3Azimit. This is only the showcase for the project Zimit. We want to do more afterward of course. This project is not about scraping simple/nonjs Web sites only and we won't develop a version of the zimit/warc2zim only for simple sites. We perfectly know that there is no solution to be sure to scrape properly all web sites, but this tool is develop with this vision. |
I see several cases, depending of the complexity of the website we need to scrape : Basic web site (even with js).Simply download the resources (html/css/js/images/...), rewrite url to be relative. Advance web site"Advance" is about the js that will generate the url (local and external). Complex web siteUrl are generated by the js but cannot be detected in advance.
First of all, the rules are specific to the website, so we are out of the scope of a generic scrapper (even if the scrapper can integrate this rules in practice). Then, we can have this specific rules handling implementing using code (in a SW) but we can also have this rules in form of data (think about rewrite rules of nginx or apache (https://www.nginx.com/blog/creating-nginx-rewrite-rules/)). If this is not possible, we indeed need a specific SW but :
Specific web site.Well, it is specific, each website will have its own scrapper. It will be always a necessity to have some specific scrappers. Is a case missing ? |
@ivanistheone, am I right to assume that all of Kolibry's catalog is available in this simple ZIP format? If so, I agree it would be wise to create a (simple then) kolibri2zim that would bring content from your catalog to ZIM readers users. Is there an API or another way to retrieve those ZIP files locations? |
This link didn't load for me in FF or Chrome:
Wow! This is very impressive it was able to grab the whole page including videos without any human intervention!
Yeah I see, there are lot of them so the fully automated approach makes sense. I was a Doubting Thomas about this approach thinking it would be too difficult to do, but looking at the kiwix-dev site I'm now Beleiving Ivan ;)
Note quite. The HTML5Zip files are only used for webpages and interactive content, while audio, video, PDF, and ePub are represented as separate content kinds. You can see an overview of Kolibri content kinds and files types here. A Kolibri channel consists of two pieces:
This way of packaging content into individual, self-contained nodes is what enables subsetting (select only subset of a channel), remixing (combine content from multiple source channels), and curriculum alignment (organize content according to local educational standards). You can get a full list of all public channels at this endpoint /api/public/v1/channels and this CLI script can helpful if you want to look around and explore a channel structure:
then the
The idea for The reverse transformation |
Ah thanks, odd that it worked for me initially, but here's an updated link.. (helped me find a potential issue!)
There was a little bit of human intervention :) I manually clicked play on the 3 youtube videos, since I was testing just one page. The automated crawling will should playing youtube videos as well. |
Regarding the number of approaches with warc2zim, it seems like @mgautierfr is suggesting 4, @ivanistheone has suggested 3, I have suggested 2, and @kelson42 only wants to support one :) I think the main issue with too many approaches is going to be the maintenance burden. The sw system is designed to address the most complex use case, but should work for the simpler one as well. But how do you determine reliably if a site is basic vs advanced vs complex, and what if some pages are more complex than others? Having a separate rewriting system for basic vs advanced vs complex will be a maintenance nightmare. I only suggest the basic approach so long as there is no additional, maintenance: the zims produced are like other zims already in the system, and require no extra rewriting logic to be maintained (or maybe using only what is already in the python-scraperlib). If it can't reuse what's already there, it should just use the full-featured SW system, which I am maintaining already, and I agree with @kelson42. |
I find I often need more info about a site that I want to scrape before I scrape it or I end up with something too big or not doable. So I am working on a scanner, basically a rewrite of Ivan's basiccrawler. The idea is to do a requests.head on all urls on a site, but only download and analyze type text/html; the others have only type and size information saved. I had thought of trying to classify site complexity as part of this analysis process, and it looks to me like that might be useful in the current context where some (non-mediawiki) sites are simple enough for httrack or wget and some are increasingly complicated and require a headless browser to do additional rendering and would require warc extraction in order to download. The output is a number of json files that describe site and page structure, though with less detail than a warc. I think it would be useful to define complexity scenarios and how to measure them for use in scanning prior to selecting the download strategy. |
I propose 4 way to categorise website. But only 1 solution : the third one. (The first and second ones using only a part of the third solution (mapping using no regex or no mapping at all)).
I don't think that implementation should be different and try to categories the website. The scrapper should always do the same things :
|
@mgautierfr I think our main point of disagreement is where the rewriting is happening.
Yes, there are key points, and what I am saying - this should not be in the content. Besides using the service worker, the rewriting system can also be implemented in python and the service worker code can essentially run in NodeJS as well. So you have at least two other options: run an embedded nodejs or python on the server, which is possible to do on both android and iOS, and then you don't need a service worker. Note that service workers do work in iOS, but only on Safari, not in any other embedded browser (blame apple!). On the other hand, if you want to build a custom rewriting system, that is incompatible with how wabac.js or pywb work, it is of course possible, but you'll need to maintain it on your own, and on each platform. I think that will increase the maintenance burden for kiwix considerably. |
I disagree with that :)
In zim/kiwix we used do the rewritting when scrapping, not replaying (and actually, there is no concept of replaying, we are just reading a archive).
Yes, this is silly. But we already have that in another occasion. A wrong zim file is a wrong zim file. It is not to the reader to workaround that. One main concept is that readers are not dependent of the zim content. Zim content follow "well known structure" and readers know only about this structure. If there is a bug in the reader and a feature missing, we have no choice, we must update the reader. But I don't want the reader and the content being coupled.
If some bug is found, it is a faulty zim file. If recrawling is not possible, we can recreate a new zim file from a old one and fix bug at this moment. I really doubt that a new version of a browser breaks a rewriting. And I have no numbers, but I think that most of our users actually don't use a browser to read a zim file (but native readers instead)
This is especially what I don't want. I don't want to enforce a technology. Everything in zim/kiwix is about data. Everything is specified (our should be) and user are free to use the technology/implementation they want. Of course, we could use python/node to implement that. Then we could change our ios application, and then our desktop one, and the server and ... ios has no service worker, kiwix-js has no c++, kiwix-serve probably need a service worker, kiwix-desktop may have a service worker but we need to change all our request handling. If we don't want to enforce a technology (a piece of algorithm implemented in one language), we need to do it the good old way : specified things (and so think about what we want to handle) as data and let readers implement the way they want (and we can provide default implementation as we do with libzim). We was thinking that putting a webreplayer in a zim file was enough and can work with the "well know structure". But it appears it was a mistake and we actually need to extend the whole system. We are doing it right now, by understanding what are the spec, specifying what kind of data we want to handle and how. So my main questions is for now : Do I miss something with the url rewriting. Can all rewrite rules be specified using regex ? |
@mgautierfr I think you are misunderstanding the scope of the problem..
This approach works when you are reading a simple, static site, but we are talking about dynamic websites, which must emulate the complexity of any http server.
Yes! That is correct, if you want to archive any generic web page, including with latest features, you may need to update the reader for the page to work!
For example, in archiving https://flexbooks.ck12.org/cbook/ck-12-precalculus-concepts-2.0/section/1.3/primary/lesson/point-notation-and-function-notation-pcalc, I found a small edge-case that required updating the reader.. The WARC I created was fine, but the wabac.js needed a small, generic update. This is why WARC files only store the raw data generally, the 'rules' comprise a complex implementation and many different types of rewriting..
Unfortunately, it just doesn't work that way. This is like saying, here is some HTML/CSS/JS, you are now free to write your own browser to render it! The complexity of replaying any website, which as I understand was the goal of the zimit project here, requires a complex system that can emulate various properties of an http server and manipulate client-side JS environment. Such a system is necessary to capture and replay, for example, an arbitrary page with embedded youtube videos, and many other examples. This is just not a problem that can be solved with regexes, as I've mentioned additional issues above. I think there's really two choices: the generic approach as was implemented here to replay dynamic websites, which requires a complex replay system (not just reading a zim file), or you can convert dynamic sites to static ones, on a case-by-case basis, as you (and Kolibri) and have already been doing. Or you can start something new, but I think it will still end up going in one of the two directions in the end. |
Please help me understand the scope instead of saying me I'm wrong and that we must use wabac.js Can you provide example of what url rewriting cannot be handled by regex ? (regex with backreference can be really powerful)
What was the edges-case ? What change have you made ?
That is why standards exist. They can evolve with time and new feature but we must not do specific readers. |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
Just very quickly, the wabac.js system does the following:
Of course, not all of these approaches are needed for every site, but to have a generic system that works for as many sites as possible, this is what's needed. Yes, regexes are used throughout when actual rewriting is needed, but they alone insufficient to address all of these use cases for replaying complex websites. At this point, this issue seems to be too broad, and maybe a new issue can be opened to address specific questions/suggestions? |
Note: this discusses post-1.0 warc2zim possibilities.
With the current warc2zim approach, the overall scrapping process can be summarized as:
The main advantage of this is that the result is very similar to the online experience.
The drawback is that we are no longer dealing with a ZIM file but a WARC file stored into a ZIM file. This is completely dependent on the ServiceWorker technology that is not available everywhere (qt, apple).
Also, making it requires some important changes to our architecture and some bugs can only be worked around. Main design issue is with the confusion between reader and content.
The accumulation of tickets related to making warc2zim's SW work with our stack lead to the discussion that it may be easier/smarter to adapt w2z to the ZIM toolchain and not the other way around.
The goal is not to avoid modifying our stack but to modify it in the way that makes the most sense and is maintainable.
It's important that we all understand and document actual requirements behind features so that we can properly discuss alternatives or implement the required tickets on the libs.
Requirements
Alternative
Based on the requirements, we can imagine the following:
scraping
libzim
libkiwix/readers
This would remove the need for an in-ZIM service worker.
Side requirement
The above list of change requires readers to catch any request which is fine (although require changes) for our readers but not for kiwix-serve.
The solution would be (as already suggested) to transform kiwix-serve into a rich web-based reader.
This could be implemented in a similar manner to what warc2zim did: an outer UI with the actual, not-modified content in an iframe and a kiwix-serve Service Worker to handle the requests.
Of course, that kiwix-serve reader would face the same SW issues with browsers (only Safari on iOS, no FF private mode, HTTP, etc) but that could easily be detected by the reader and just disable SW-requiring ZIMs if there's no SW support while keeping the same UI.
Not sure how a rich-client kiwix-serve and kiwix-js would benefit from each-other.
@mgautierfr @kelson42 @ikreymer
The text was updated successfully, but these errors were encountered: