Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss future implementation #48

Closed
rgaudin opened this issue Aug 13, 2020 · 25 comments
Closed

Discuss future implementation #48

rgaudin opened this issue Aug 13, 2020 · 25 comments
Labels
question Further information is requested Service Worker stale

Comments

@rgaudin
Copy link
Member

rgaudin commented Aug 13, 2020

Note: this discusses post-1.0 warc2zim possibilities.


With the current warc2zim approach, the overall scrapping process can be summarized as:

  1. scrape the content into a WARC file (collection of Request+Response headers and payload)
  2. Iterate over the WARC entries to store both Headers and Payload into different ZIM articles
  3. Reader serves a SW-based replayer that simulates the matching requests by sending headers and payload together.

The main advantage of this is that the result is very similar to the online experience.

The drawback is that we are no longer dealing with a ZIM file but a WARC file stored into a ZIM file. This is completely dependent on the ServiceWorker technology that is not available everywhere (qt, apple).
Also, making it requires some important changes to our architecture and some bugs can only be worked around. Main design issue is with the confusion between reader and content.

The accumulation of tickets related to making warc2zim's SW work with our stack lead to the discussion that it may be easier/smarter to adapt w2z to the ZIM toolchain and not the other way around.

The goal is not to avoid modifying our stack but to modify it in the way that makes the most sense and is maintainable.

It's important that we all understand and document actual requirements behind features so that we can properly discuss alternatives or implement the required tickets on the libs.

Requirements

  1. Some requests (ajax) needs specific headers on the response to behave properly. needs clarification
  2. Requests can be initiated from by JS code. Those can't be found and rewritten at ZIM creation time.

⚠️ PLEASE add or correct other requirements here.

Alternative

Based on the requirements, we can imagine the following:

scraping

  • Keep using the WARC toolchain to produce WARC files.
  • Store both the Response headers and payload in ZIM (not in a single article's payload). Request headers not required (!?)
  • Some metadata or tag indicates that a ZIM requires full-URL access to content

libzim

  • Can store article headers as part of some metadata

libkiwix/readers

  • Readers include stored headers in responses to requests.
  • Readers catch all requests and look for non-relative requests in ZIM and return responses accordingly.

This would remove the need for an in-ZIM service worker.

Side requirement

The above list of change requires readers to catch any request which is fine (although require changes) for our readers but not for kiwix-serve.
The solution would be (as already suggested) to transform kiwix-serve into a rich web-based reader.
This could be implemented in a similar manner to what warc2zim did: an outer UI with the actual, not-modified content in an iframe and a kiwix-serve Service Worker to handle the requests.

Of course, that kiwix-serve reader would face the same SW issues with browsers (only Safari on iOS, no FF private mode, HTTP, etc) but that could easily be detected by the reader and just disable SW-requiring ZIMs if there's no SW support while keeping the same UI.

Not sure how a rich-client kiwix-serve and kiwix-js would benefit from each-other.

@mgautierfr @kelson42 @ikreymer

@rgaudin rgaudin added question Further information is requested Service Worker labels Aug 13, 2020
@mgautierfr
Copy link
Contributor

I disagree with both approach.

  • The first one, as you said, is having a warc archive in a zim file.
  • The second is to transform our zim readers into (somehow) warc readers.

ZIM format is a specific format with a specific way of working and requirements. Content in a zim format must comply with this. (dot).

Some of those requirements are :

  • Links in the html/css/.. are relative to the local (current) path.
  • content in the zim file is directly usable. No modification needed before sending it to the web client (webview embedded or real browser).
  • The path/url of the entries are the way to access and identify them.

Those requirements were somehow never really written and are more kiwix requirements than zim ones. But anyway, we want to access the content using kiwix tools.

The warc2zim should do what its name implies : Transform content using warc format into content using zim/kiwix format.
This is the wrong way to putting warc content in zim and ask readers to adapt.
If we want to use a format different than zim... we should use something else than zim.


From what I understand of warc format is that is it a format to store a web navigation (a browser/scrapper navigating on a web site).
It records the request sent and the answers returned in a series of records.
Records correspond to what have been exchange, raw.

  • If the scrapper requests a resource and the answer is "304 Not modified", then two records are stored in the warc.
  • If the scrapper get a whole file using X partial requests, then 2*X records are stored in the warc.
  • ...

It is nice because the scrapper doesn't have to be smart (a least on this part). Just put what it has done in the warc.

The player (we use https://github.com/webrecorder/wabac.js and especially the kiwix branch) is the smart part putting everything together.

It acts as a proxy, intercepting the request from the browser, get the headers and data corresponding to the request from the kiwix provider (kiwix-serve/the hook providing content not native application), and send back a response to the browser with the recorded response/content.

Note that request records are not in the created zim file. I don't know if warc2zim remove them or if they are simply not in the warc.

On the opposite side, zim format store plain content and the kiwix player act as a real server :

  • It handle cache/not modified as it wants (kiwix-serve make the browser cache it, native app may use a custom cache, or not).
  • It handle partial requests. For video for example, kiwix-serve understand partial headers, kiwix-android create a specific video player and feed it with a content provided, totally bypassing the web interface).

The warc2zim should be the smart part and transform a warc format in a zim format (following kiwix specification) :

  • For each 200 records, simply put the content in a zim item.
  • Transform 302 records, in zim redirect.
  • Merge all partials response in only one content and create a item for that.
  • Ensure that internal are ok.
  • ...

Then it would introduce no change at all on kiwix side and it would work in all our platforms.

@rgaudin
Copy link
Member Author

rgaudin commented Aug 13, 2020

What you propose offers no solution to the dynamic requests problem. Having the response in the ZIM is not the problem, as those are in the WARC already.

The problem is that we can't rewrite JS code that would lead to a request for https://something.outside.zim/path and thus, when reading the ZIM, the browser will try to access those links without going through the ZIM.

@mgautierfr
Copy link
Contributor

The problem is that we can't rewrite JS code that would lead to a request for something.outside.zim/path and thus, when reading the ZIM, the browser will try to access those links without going through the ZIM.

That is the real spec, we should have start from that :)

What we need, is to store a mapping absolute url address (we cannot change in js) -> resource name in zim file.

Once we have this absolute url mapping, it would be up to the readers to use it as they want :

  • For native readers, it will be pretty simple. Catch all request from the webview and serve the content if the url is in the mapping. On kiwix-desktop it would be less that 50 lines and probably the same on android/ios.
  • kiwix-js would have to extend its service worker to also handle absolute url (not a big change)
  • On kiwix-serve it will be a bit more complex as we want to catch the request on the browser side before it goes out. We probably need a service worker but it would be a pretty simple one, provided by kiwix-serve itself.

Where to store this mapping ?

The simpler is probably a specific metadada. Readers would have to read the metadata and parse it to handle it (kiwix-lib/libzim can help here).

How to build this mapping ?

Scrappers already does it and store this in a warc file, let warc2zim use this information.


Some requests (ajax) needs specific headers on the response to behave properly. needs clarification

We really need clarification.
I'm not sure at all we should/need handle that.
URL is by definition the identifier of a resource. Headers is a way to specify the context in which we get this resource and the context is constant in a zim file.

@rgaudin
Copy link
Member Author

rgaudin commented Aug 18, 2020

Good. We're on the same page.

Regarding the headers stuff, it's just a hunch based on the fact that WARC stores them but it might not be a requirement. @ikreymer would clear that up.

@ikreymer
Copy link
Collaborator

ikreymer commented Aug 18, 2020

The WARC file is just a storage format, like ZIM, and is designed to store HTTP request/response data, as well as other arbitrary data. Like ZIM, it can be used in an number of different ways.

ZIM format is a specific format with a specific way of working and requirements. Content in a zim format must comply with this. (dot).

Some of those requirements are :

Links in the html/css/.. are relative to the local (current) path.
content in the zim file is directly usable. No modification needed before sending it to the web client (webview embedded or real browser).
The path/url of the entries are the way to access and identify them.

I understand that the ZIM format was created for MediaWiki, but unfortunately, the rest of the web does not work like MediaWiki. I think this approach works well as long as you are dealing with static sites only with no Javascript, like MediaWikis.

From what I understand of warc format is that is it a format to store a web navigation (a browser/scrapper navigating on a web site).
It records the request sent and the answers returned in a series of records.
Records correspond to what have been exchange, raw.

If the scrapper requests a resource and the answer is "304 Not modified", then two records are stored in the warc.
If the scrapper get a whole file using X partial requests, then 2*X records are stored in the warc.
...
It is nice because the scrapper doesn't have to be smart (a least on this part). Just put what it has done in the warc.

If only it worked that way... :) Yes, you could store the 206 responses in the WARC, it is allowed, but it is not a good idea because it will be hard to replay. The WARC spec allows storing any request/response in the WARC of course, how best to do so is up to the capture and replay tools. All Webrecorder tools avoid storing 206 responses for example, or filter them out, to only serve the full 200 record. This is necessary to make videos work, for example. But this isn't the main issue.

On the opposite side, zim format store plain content and the kiwix player act as a real server :

It handle cache/not modified as it wants (kiwix-serve make the browser cache it, native app may use a custom cache, or not).
It handle partial requests. For video for example, kiwix-serve understand partial headers, kiwix-android create a specific video player and feed it with a content provided, totally bypassing the web interface).

Yes, the replay wabac.js handles range requests as well, though cacheing is disabled for now.
It's designed to do the same on all platforms, though.

The ideal solution is to store the raw data as much as possible, and rewrite as needed in the player.
Why? because the rewriting requirements may change, as browsers change, new workarounds are needed.
If the data is changed as its written to ZIM, then the original is lost, and if new changes are needed it is harder to fix again.
It's not so much about archival integrity as about being able to replay the archived content in new browsers, environments, etc... Saving a particular 206 response or 304 is not especially important for most use cases, as no one will see those!

What we need, is to store a mapping absolute url address (we cannot change in js) -> resource name in zim file.

Once we have this absolute url mapping, it would be up to the readers to use it as they want :

For native readers, it will be pretty simple. Catch all request from the webview and serve the content if the url is in the mapping. On kiwix-desktop it would be less that 50 lines and probably the same on android/ios.
kiwix-js would have to extend its service worker to also handle absolute url (not a big change)
On kiwix-serve it will be a bit more complex as we want to catch the request on the browser side before it goes out. We probably need a service worker but it would be a pretty simple one, provided by kiwix-serve itself.

This is what's being done now with A/example.com and H/example.com.. I'm not sure that this is necessarily the best approach, but one that we've chosen so far.

But its not just about loading exact urls, and its not that simple :) There's a lot more involved! For instance, many sites require the window.location to match, and if loaded in a different location, the site breaks. The wabac.js system injects a lot of javascript to emulate the original site. Also, many requests are not exact, for example, a URL captured might be https://example.com/?_=123 but on request there is https://example.com/?_=124 and it will not match.
The wabac.js system handles this sort of fuzzy matching as well, which actually is not possible with the current setup.

I think a main question is: what kind of sites do you want to capture?

If most of your sites are like Bouquineux.com, then I agree, you don't need any of this.

But then you have sites like, CK12, ex, https://flexbooks.ck12.org/cbook/ck-12-precalculus-concepts-2.0/section/1.3/primary/lesson/point-notation-and-function-notation-pcalc which are highly dynamic, and include embedded youtube videos. These are a challenge to capture and replay, and the wabac.js is designed to deal with replaying sites like this.
In fact, currently the embedded youtube video doesn't work in the ZIM version due to lack of fuzzy matching.

My guess is that more and more (non mediawiki) sites will start looking like CK12, and not like Bouquinuex,
and I think this project is about being able to support such sites, as I understand it.

I would like to suggest an alternative solution, simply storing the WARCs, or rather WACZ file directly in the ZIM.
WACZ is a new format I'm developing that can store everything needed, including the index, (and soon even full text search data), so there can be just one file, A/example.wacz and the associated viewer files.

To interface, with the existing Xapian search, perhaps entries for the searchable pages are made, and all they do is redirect to the service worker, as we have now, and are there only to store the text -- the replay will be handled by the system.

I am also thinking about maintenance, and this would be the least maintenance approach, as long as all platforms can support service workers. I think this would avoid having different implementations in different platforms, which I think is going to be a maintenance nightmare. An even better approach would be to support loading wabac.js directly in kiwix-lib.

Unfortunately, It's simply not possible to get sites like CK12 w/ dynamic content and youtube videos to work w/o service workers, or without doing a whole lot of custom rewriting work for each site.

It is up to you to decide if that is important or not for what you're trying to do.

@rgaudin
Copy link
Member Author

rgaudin commented Aug 18, 2020

Thanks for the feedback @ikreymer.

@ikreymer
Copy link
Collaborator

Regarding the requirements, I think they are not different than wabac.js overall iff you want to support any website

  1. Some requests (ajax) needs specific headers on the response to behave properly. needs clarification
    Yes, HTTP headers may be needed for some ajax requests on some sites.. This is why they are kept, because figuring out when they're needed and when not is very tricky.. The headers make very little difference in size but can make difference between site working and not working.
  1. Requests can be initiated from by JS code. Those can't be found and rewritten at ZIM creation time.
    Yes, this is another reason why rewriting at ZIM creation time can be prone to errors. By having the replay system in the renderer, allows for making adjustments to make sure the data is replayable.

To summarize from above:

  1. Location and origin rewriting, the system rewrites window.location, document.origin and a bunch of other JS commands that retrieve the location. This is done by client-side injection and rewriting, both at service worker and in the client (for dynamically added scripts).

  2. Fuzzy matching, due to timestamps, and other dynamic urls, such as https://example.com/?_=123 capture and https://example.com/?_=124 is queried. wabac.js can handle this by keeping an index in IndexedDB and doing a prefix search for https://example.com/? and finding best match. There are also some domain specific rules for larger sites, like youtube, that require occasional updating.

The wabac.js already handles this, but updates will always be needed. The wabac.js sw.js file is <1MB when minimized, but if it needs to change, then an entire ZIM, which may be many GB, will need to be regenerated. This is why the ideal solution is to support the service worker in kiwix-lib, that way only a new version of kiwix-lib is needed (and release of other tools, but there is already a tool chain for building all of the dependencies).

@ikreymer
Copy link
Collaborator

On the other end, for sites like bouquineux.com, which are all static and have no videos, no javascript, it would be easy to add a 'classic zim' mode where warc2zim produces 'standard' zims, eg. www.bouquineux.com/index.html -> A/index.html, and no service worker is needed. This is entirely doable as well, and may work well for many sites and still be quite useful. It is a bit much to require service worker for a static site like bouquineux.

I even started prototyping this 'simple' mode initially for warc2zim, though @kelson42 said it was not needed, but it could be done. It will work fine for bouquineux.com and sites like it, and could warn if classic mode will not work for a certain approach.

I think the two options should be:

  1. classic zim for single-domain, no JS sites
  2. service worker mode with service worker in kiwix-lib for more complex sites

@mgautierfr
Copy link
Contributor

There is a lot of things here. I've try to regroup this into sections.


js in website

I understand that the ZIM format was created for MediaWiki, but unfortunately, the rest of the web does not work like MediaWiki. I think this approach works well as long as you are dealing with static sites only with no Javascript, like MediaWikis.

Mediawiki is not a static site, there is js running in it.
And we can already have youtube channel (https://download.kiwix.org/zim/other/hygiene-mentale_fr_all_2020-01.zim) in a zim file.
phet zim files (https://download.kiwix.org/zim/phet/phet_en_2020-08.zim) are already dynamic website with a lot of js.

zim (as warc) replace the server part. Nothing prevent us to have complex js running on a site web embeded in a zim. What we need is to have all resources in a zim to be able to provide them to the web site in place of the server. This means that we need to know them. It is the work of the scrapper but the problem is the same for zim and warc.

All Webrecorder tools avoid storing 206 responses for example, or filter them out, to only serve the full 200 record.

Yes, of course. My main idea was to say that warc and zim was designed differently. zim store resources and warc records exchange. It doesn't mean you cannot store curated exchanges (and I hope you do it, the same way we store curated resources)

Yes, the replay wabac.js handles range requests as well

And it is potentially a problem.
How is stored the video ? As one file (wabac.js "simply" forwarding range request to the server) or each range is a item (and wabac.js (full) request each range to kiwix-server) ?

But then you have sites like, CK12, ex, flexbooks.ck12.org/cbook/ck-12-precalculus-concepts-2.0/section/1.3/primary/lesson/point-notation-and-function-notation-pcalc which are highly dynamic, and include embedded youtube videos.

Can you be more specific ? js can run in a embedded site in a zim file. website will do requests, if we answer the same content for each requests made everything should work.

I understand that the problem is to determine want is the real identifier of a content in a request.
My point is that it is the url only. Other part (headers) is contextual (how to get the content, not what content). And yes, "fake" server MUST handle that. It may work a time if we have simple website, but if we have complex site, we have to handle that. Transform a complex website in a simple one is not the solution.

If our "servers" have to do it regularly (parse headers, ...) we can move that in the kiwix-lib instead of letting all application implement that.

Also, many requests are not exact, for example, a URL captured might be https://example.com/?_=123 but on request there is https://example.com/?_=124 and it will not match.
The wabac.js system handles this sort of fuzzy matching as well, which actually is not possible with the current setup.

You need to be more clear about that.
I don't understand how it could work in the original website and not in a zim file.
If a website do a https://example.com/?_=123 request and get content for that we simply store the content using the path ?=123. (ignoring the host problem for now). If the embedded requests the ?=123 resource, we give it. We don't care if the original server was actually return the content of 124 stuff. Why we should rewrite the url request at all ?

And even if we must rewrite it. If wabac.js can know that it have to do the rewrite, we can also know it at zim creation and handle that (store the content of https://example.com/?_=124 with the key https://example.com/?_=123)


Adapt to browsers/specific website.

Why? because the rewriting requirements may change, as browsers change, new workarounds are needed.
If the data is changed as its written to ZIM, then the original is lost, and if new changes are needed it is harder to fix again.
It's not so much about archival integrity as about being able to replay the archived content in new browsers, environments, etc...

There are two points here.

  • requirement of the web site itself. If somehow the browsers change and are not able to browse specific website, it is not our problem. If the site update for the browser, we must do a new "capture" of the site web. We will not fix not working website.
  • requirement of our technology with the browsers. Yes we must adapt. But then our piece of technology should be in the "server", not the content.

If the data is changed as its written to ZIM, then the original is lost, and if new changes are needed it is harder to fix again.

I never seen a (major) browser breaking compatibility with old website.

The wabac.js already handles this, but updates will always be needed.

Why ? I understand that wabac.js need to be update because website change. But the content in a zim doesn't change.


How we do this.

This is what's being done now with A/example.com and H/example.com.. I'm not sure that this is necessarily the best approach, but one that we've chosen so far.

Yes, we have made choices and they may be not the best. Splitting data between 'A' and 'H' was a decision made based on incomplete understanding of the need and the technologies involved. By definition it a bad decision (even if the solution is the good one).

This is why the ideal solution is to support the service worker in kiwix-lib, that way only a new version of kiwix-lib is needed (and release of other tools, but there is already a tool chain for building all of the dependencies).

I agree that we should handle such things in the client side. But I'm not sure SW is the solution. (because it doesn't work everywhere, and we know it leads to some issue with other features).
wabac.js takes decisions base on information. What I want is to identify what are this information and how to store them coherently with other information store in a zim file. And then clients will be able to use them.

I am also thinking about maintenance, and this would be the least maintenance approach, as long as all platforms can support service workers. I think this would avoid having different implementations in different platforms, which I think is going to be a maintenance nightmare. An even better approach would be to support loading wabac.js directly in kiwix-lib.

But service workers will not work on ios (at least). And this is understandable.
A SW is a intermediate server handling the request and provide specific content to a website (in place of the browser with regular requests).
With embedded webview, there is already possibility to implement such proxy server (and we need to implement one anyway). I understand that apple simply decide to avoid such duplication.


What do we want to support ?

I would like to suggest an alternative solution, simply storing the WARCs, or rather WACZ file directly in the ZIM.
WACZ is a new format I'm developing that can store everything needed, including the index, (and soon even full text search data), so there can be just one file, A/example.wacz and the associated viewer files.

Well, we could also simply make our application play warc files directly. No need for zim at all in this case.

For instance, many sites require the window.location to match, and if loaded in a different location, the site breaks

Do we really want to support that. If site require such thing it is probably because they don't want to be replayed/embedded. Do we want to fight against them ?

Fuzzy matching, due to timestamps, and other dynamic urls, such as https://example.com/?_=123 capture and https://example.com/?_=124 is queried. wabac.js can handle this by keeping an index in IndexedDB and doing a prefix search for https://example.com/? and finding best match. There are also some domain specific rules for larger sites, like youtube, that require occasional updating.

Ok, timestamp is a reason to have something changing. But how many site use the current time to identify a content ? How can they identify one content with a constant changing key ? Or the content is all the time changing.

Or the timestamp in the url is not part of the identifier. I don't know how to handle that for now. But how many site does this ?

@ivanistheone
Copy link

ivanistheone commented Aug 19, 2020

Hi All, I'm excited to see the rapid progress on warc2zim. Thank you for holding these discussions in the open, so we can all learn from them.

I want to provide some experience from packaging content for offline use in the "high-touch regime" (not just archiving as-is, but using custom scrapers specific to the source website).

Here is the the CK-12 flexbook article discussed rendered in Kolibri:
https://kolibri-catalog-en.learningequality.org/en/learn/#/topics/c/8cffcd06241f5123b2de553fe4898984?searchTerm=Point%20Notation%20and%20Function%20Notation
or direct link to download the HTML5Zip file:
https://kolibri-catalog-en.learningequality.org/content/storage/c/a/ca77217a28604814551b536edca857a3.zip

  • The format is a regular zip file that includes index.html in the root and all web assets with all internal links rewritten so they use relative paths referring to assets within the zip file.
  • The Kolibri renderer simply serves the index.html inside the .zip file as if were a static site.
  • See here for the source code for the scraper+transform steps required to produce this zip file.

The main problems with this "high touch" approach of rewriting of the web content are:

  • expensive = each site requires a different scraper
  • scraping code is brittle (if source website changes, the scraper code will stop working)
  • duplication of work as multiple orgs develop their own scrapers for their specific use case

This is why I'm so excited by the potential for an automated archiver and scraper tool that works on most sites and doesn't require custom code for each site.

@ikreymer said: I think the two options should be:

  1. classic zim for single-domain, no JS sites
  2. service worker mode with service worker in kiwix-lib for more complex sites

Perhaps we could subdivide option 1 into two:

  • 1a. classic zim for single-domain, no JS sites, no service worker [like wget --convert-links --page-requisites ... | zimwriterfs] (this will work only for sites with no dynamically loaded content)
  • 1b. classic zim+local js [post-process WARC files to establish complete manifest of resources that were requested, save resources as regular files in the .zim, try to rewrite references to resources based on manifest] (this will work for a wider range of websites)
    1. service worker mode [full playback capabilities]

I'm not sure how feasible approach 1b. would be, but it seems a js-enabled browser + post processing could work for A LOT basic content sites. And for cases where basic js-enabled browser + resource links rewriting doesn't work, the option 2 will be available with full replay power.

@rgaudin @mgautierfr Do you have an idea of what websites you want to be able to scrape using warc2zim? Do you need to solve the general problem of scraping arbitrary websites or can you focus on simpler websites with mostly content and only a bit of javascript?

If you have a "to scrape" list in mind that you can share, I'd like to take a look as I wonder if there is any overlap with the channels in the Kolibri Content Library, in which case I can look into creating something like kolibri2zim if you think this will be helpful (I have a script to create static html export from a Kolibri channel, so I just have to learn how to use zimwriterfs).

@ikreymer
Copy link
Collaborator

Hi @ivanistheone, thanks for sharing your approach as well.

I got that particular example from looking at recent requests here: https://github.com/openzim/zim-requests/issues

The intent of this project has been for Kiwix to avoid having to write custom scraper/crawlers for these sites.
I think a big question is how common are these difficult sites vs simpler sites, I don't really know the answer to that.

Thanks for sharing your rendering of this site.

For comparison, here is a ZIM replay of the same page, created with latest warc2zim and using kiwix-serve:
https://kiwix-dev.webrecorder.net/ck12-test/A/flexbooks.ck12.org/cbook/ck-12-precalculus-concepts-2.0/section/2.1/primary/lesson/factoring-review-pcalc

It required no custom work on this site (though I did find a generic bug in wabac.js which prevented the notations from working initially). Unfortunately, the youtube videos do not load with the current setup, due to lack of fuzzy/inexact matching support - the youtube URLs ignore some sort of dynamic hash.

Here is the same page using the full replayweb.page system that I've been working on:
https://replayweb.page/?source=https%3A%2F%2Fdh-preserve.sfo2.cdn.digitaloceanspaces.com%2Fwebarchives%2Fzim%2Fck12.wacz#view=replay&url=https%3A%2F%2Fflexbooks.ck12.org%2Fcbook%2Fck-12-precalculus-concepts-2.0%2Fsection%2F2.1%2Fprimary%2Flesson%2Ffactoring-review-pcalc

This should work with the youtube videos playing, in Chrome and Firefox. In Safari, unfortunately, the videos currently don't play as Safari doesn't support the encoding youtube uses for Chrome and FF, so that's an extra step that'd be needed to capture in a different format probably, or do a conversion (the video is loaded from an mp4).

But, the page and equations should load in all modern browsers which support service workers, including mobile, and requires no server-side infrastructure. (It should work in mobile safari, though not the videos yet, due to the format issue)

The static file is just loaded from:
https://dh-preserve.sfo2.cdn.digitaloceanspaces.com/webarchives/zim/ck12.wacz (using the new WACZ container which bundles WARCs and indices for fast access).

I think this serves to demonstrates the spectrum of possibilities that we have available:

  • The Kolibri approach, which results in a simple, static page, but lots of custom work per site.
  • The full replayweb.page approach, which yields the highest fidelity (though still some cross-browser issues to resolve), and is also complex to maintain, but is generic.
  • The current warc2zim solution, which is almost like replayweb.page, but must work within the constraints of ZIM and kiwix-serve, and has some limitations (no fuzzy matching).

@kelson42
Copy link
Contributor

@rgaudin @mgautierfr Do you have an idea of what websites you want to be able to scrape using warc2zim? Do you need to solve the general problem of scraping arbitrary websites or can you focus on simpler websites with mostly content and only a bit of javascript?

@ivanistheone Thank you for participating to this discussion. Here is the list of Web sites we want to scrape with Zimit (warc2zim) https://github.com/openzim/zim-requests/issues?q=is%3Aopen+is%3Aissue+label%3Azimit. This is only the showcase for the project Zimit. We want to do more afterward of course. This project is not about scraping simple/nonjs Web sites only and we won't develop a version of the zimit/warc2zim only for simple sites. We perfectly know that there is no solution to be sure to scrape properly all web sites, but this tool is develop with this vision.

@mgautierfr
Copy link
Contributor

I see several cases, depending of the complexity of the website we need to scrape :

Basic web site (even with js).

Simply download the resources (html/css/js/images/...), rewrite url to be relative.

Advance web site

"Advance" is about the js that will generate the url (local and external).
We can detect those urls but we can change them. Then a "simple" url mapping stored in the zim file handle by the client is enough. No need for a SW (at least one stored in the zim file, kiwix-serve will need (and provide) one).

Complex web site

Url are generated by the js but cannot be detected in advance.
I suppose this is something like https://example.com/video.mp4?id=123456&keyword=foo<generated>
Here the video is identified with https://example.com/video.mp4?id=123456 but the keyword=foo<generated> is generated dynamically base on current time, screen resolution, ...
By using specific rules, we can detect this kind of url to :

  • When writing, store the video using the path video.mp4?id=123456 (or what event scrapper defined)
  • When reading, remove the unnecessary part and do the request to video.mp4?id=123456

First of all, the rules are specific to the website, so we are out of the scope of a generic scrapper (even if the scrapper can integrate this rules in practice).

Then, we can have this specific rules handling implementing using code (in a SW) but we can also have this rules in form of data (think about rewrite rules of nginx or apache (https://www.nginx.com/blog/creating-nginx-rewrite-rules/)).
If we can have this rules in form of data, we don't need a specific SW.

If this is not possible, we indeed need a specific SW but :

  • We don't need to have the "complex" webreplay loading page in a iframe. Each html article can be modify to load a js loading the SW if not already present. So all the kiwix ecosystem is still working.
  • It will not work on ios.
  • It may conflict with the SW provided by kiwix-serve for its own purpose.

Specific web site.

Well, it is specific, each website will have its own scrapper. It will be always a necessity to have some specific scrappers.


Is a case missing ?
Are my assumptions about the url rewriting correct ? (If not, please provide examples)
Did I miss something ?
(@ivanistheone your link for the script for the source of the scrapper+transform seems broken.)

@rgaudin
Copy link
Member Author

rgaudin commented Aug 20, 2020

@ivanistheone, am I right to assume that all of Kolibry's catalog is available in this simple ZIP format? If so, I agree it would be wise to create a (simple then) kolibri2zim that would bring content from your catalog to ZIM readers users.

Is there an API or another way to retrieve those ZIP files locations?

@ivanistheone
Copy link

@ikreymer said:
For comparison, here is a ZIM replay of the same page, created with latest warc2zim and using kiwix-serve: https://kiwix-dev.webrecorder.net/ck12-test/A/flexbooks.ck12.org/cbook/ck-12-precalculus-concepts-2.0/section/2.1/primary/lesson/factoring-review-pcalc

This link didn't load for me in FF or Chrome:
didnt load

Here is the same page using the full replayweb.page system that I've been working on: https://replayweb.page/?source=https%3A%2F%2Fdh-preserve.sfo2.cdn.digitaloceanspaces.com%2Fwebarchives%2Fzim%2Fck12.wacz#view=replay&url=https%3A%2F%2Fflexbooks.ck12.org%2Fcbook%2Fck-12-precalculus-concepts-2.0%2Fsection%2F2.1%2Fprimary%2Flesson%2Ffactoring-review-pcalc

Wow! This is very impressive it was able to grab the whole page including videos without any human intervention!


@kelson42 said:
https://github.com/openzim/zim-requests/issues?q=is%3Aopen+is%3Aissue+label%3Azimit [...] This project is not about scraping simple/nonjs Web sites only and we won't develop a version of the zimit/warc2zim only for simple sites.
We perfectly know that there is no solution to be sure to scrape properly all web sites, but this tool is develop with this vision.

Yeah I see, there are lot of them so the fully automated approach makes sense. I was a Doubting Thomas about this approach thinking it would be too difficult to do, but looking at the kiwix-dev site I'm now Beleiving Ivan ;)


@rgaudin said: Am I right to assume that all of Kolibry's catalog is available in this simple ZIP format?

Note quite. The HTML5Zip files are only used for webpages and interactive content, while audio, video, PDF, and ePub are represented as separate content kinds. You can see an overview of Kolibri content kinds and files types here.

A Kolibri channel consists of two pieces:

  • An sqlite3 database that contains the channel structure (tree of TopicNodes (folders) and ContentNodes), and the associated metadata for each Node. Nodes are associated with one or more files (e.g. node thumnail, video file, subtitles files, etc).
  • A set of files (e.g. storage/a/b/abcdddddddddddddddddddd.ext where abcdddddddddddddddddddd is the md5 hash of the file contents. The file with ext=ext and md5=abcdddddddddddddddddddd can be downloaded from http://studio.learningequality.org/content/storage/a/b/abcdddddddddddddddddddd.ext

This way of packaging content into individual, self-contained nodes is what enables subsetting (select only subset of a channel), remixing (combine content from multiple source channels), and curriculum alignment (organize content according to local educational standards).

You can get a full list of all public channels at this endpoint /api/public/v1/channels and this CLI script can helpful if you want to look around and explore a channel structure:

virtualenv -p python3 venv
source venv/bin/activate
pip install requests
wget https://gist.githubusercontent.com/ivanistheone/ccc3de4f8b115984565370ec74039b53/raw/3a3f1cf59ed389c6aa638e29739dbe20d44565a1/kolibridb.py
chmod +x kolibridb.py

./kolibridb.py --channel_id 95a52b386f2c485cb97dd60901674a98 --htmlexport

then the reports/ folder will contain:

reports/
├── databases
│   ├── 95a52b386f2c485cb97dd60901674a98.json     # JSON tree representation of the channel
│   └── 95a52b386f2c485cb97dd60901674a98.sqlite3   # the channel DB file
└── kolibrihtmltrees
    └── channel_95a52b386f2c485cb97dd60901674a98_tree.html   # an HTML preview of channel structure

The idea for kolibri2zim would be to use the data from the JSON channel tree representation, download all the required files, then to fill in some templates and generate filesystem folder structure and .html files for each node (like a static site generator). Note the channel2site repo is very old and superseded by the kolibridb.py script—the only thing useful in that repo are the templates.

The reverse transformation zim2kolibri would be more complicated since we would need to infer a hierarchical tree structure for the contents inside the .zim file in order to build the Kolibri topic tree. If the source website has a nice URL structure like course/1121/unit/31/lesson1.html it would be feasible, but not in the general case for graph-like websites (e.g. wikipedia). Still, I think it is worth thinking about how we could build something like zim2kolibri in the future because having roundtrip capabilities would allow the following workflow: source zim --zim2kolibri--> source channel -- Kolibri_Studio_edits--> remixed channel --kolibi2zim--> remixed .zim, where Kolibri Studio (the channel editor) could be used to select arbitrary subsets of a .zim archive (e.g. if the source .zim is 20GB and contains content in 20 languages, the remixed .zim could be created that contains only one language and is much smaller ~1GB). @tim-moody has expressed the need for such "subsetting" of .zim files several times to support use contexts where the devices have limited storage.

@ikreymer
Copy link
Collaborator

@ivanistheone:

This link didn't load for me in FF or Chrome:

Ah thanks, odd that it worked for me initially, but here's an updated link.. (helped me find a potential issue!)
https://kiwix-dev.webrecorder.net/ck12test/A/flexbooks.ck12.org/cbook/ck-12-precalculus-concepts-2.0/section/2.1/primary/lesson/factoring-review-pcalc

Here is the same page using the full replayweb.page system that I've been working on: https://replayweb.page/?source=https%3A%2F%2Fdh-preserve.sfo2.cdn.digitaloceanspaces.com%2Fwebarchives%2Fzim%2Fck12.wacz#view=replay&url=https%3A%2F%2Fflexbooks.ck12.org%2Fcbook%2Fck-12-precalculus-concepts-2.0%2Fsection%2F2.1%2Fprimary%2Flesson%2Ffactoring-review-pcalc

Wow! This is very impressive it was able to grab the whole page including videos without any human intervention!

There was a little bit of human intervention :) I manually clicked play on the 3 youtube videos, since I was testing just one page. The automated crawling will should playing youtube videos as well.

@ikreymer
Copy link
Collaborator

Regarding the number of approaches with warc2zim, it seems like @mgautierfr is suggesting 4, @ivanistheone has suggested 3, I have suggested 2, and @kelson42 only wants to support one :)

I think the main issue with too many approaches is going to be the maintenance burden. The sw system is designed to address the most complex use case, but should work for the simpler one as well. But how do you determine reliably if a site is basic vs advanced vs complex, and what if some pages are more complex than others? Having a separate rewriting system for basic vs advanced vs complex will be a maintenance nightmare.

I only suggest the basic approach so long as there is no additional, maintenance: the zims produced are like other zims already in the system, and require no extra rewriting logic to be maintained (or maybe using only what is already in the python-scraperlib). If it can't reuse what's already there, it should just use the full-featured SW system, which I am maintaining already, and I agree with @kelson42.

@tim-moody
Copy link

I find I often need more info about a site that I want to scrape before I scrape it or I end up with something too big or not doable. So I am working on a scanner, basically a rewrite of Ivan's basiccrawler. The idea is to do a requests.head on all urls on a site, but only download and analyze type text/html; the others have only type and size information saved.

I had thought of trying to classify site complexity as part of this analysis process, and it looks to me like that might be useful in the current context where some (non-mediawiki) sites are simple enough for httrack or wget and some are increasingly complicated and require a headless browser to do additional rendering and would require warc extraction in order to download. The output is a number of json files that describe site and page structure, though with less detail than a warc.

I think it would be useful to define complexity scenarios and how to measure them for use in scanning prior to selecting the download strategy.

@mgautierfr
Copy link
Contributor

Regarding the number of approaches with warc2zim, it seems like @mgautierfr is suggesting 4, @ivanistheone has suggested 3, I have suggested 2, and @kelson42 only wants to support one :)

I propose 4 way to categorise website. But only 1 solution : the third one. (The first and second ones using only a part of the third solution (mapping using no regex or no mapping at all)).
The fourth category is here only to explicit it and be sure we agree that we accept specific wrapper for some website (we will not drop mwoffliner or sotoki).

But how do you determine reliably if a site is basic vs advanced vs complex, and what if some pages are more complex than others? Having a separate rewriting system for basic vs advanced vs complex will be a maintenance nightmare.

I don't think that implementation should be different and try to categories the website. The scrapper should always do the same things :

  • rewrite url in html and css (and js if possible).
  • for detected url we cannot replace, create a mapping.
  • for fuzzy url (and we know which website because rules are specific), create a rules in the mapping using regex.

@ikreymer
Copy link
Collaborator

ikreymer commented Aug 22, 2020

I don't think that implementation should be different and try to categories the website. The scrapper should always do the same things :

  • rewrite url in html and css (and js if possible).
  • for detected url we cannot replace, create a mapping.
  • for fuzzy url (and we know which website because rules are specific), create a rules in the mapping using regex.

@mgautierfr I think our main point of disagreement is where the rewriting is happening.
I agree that these needs to happen when replaying, but this does not happen when scrapping and this logic should not be in the ZIM, but in the server or service worker level.

There are two points here.

  • requirement of the web site itself. If somehow the browsers change and are not able to browse specific website, it is not our problem. If the site update for the browser, we must do a new "capture" of the site web. We will not fix not working website.
  • requirement of our technology with the browsers. Yes we must adapt. But then our piece of technology should be in the "server", not the content.

Yes, there are key points, and what I am saying - this should not be in the content.
New versions of browsers may require the rewriting to be updated, or some bug is found, and recrawling an old site is not always possible.
The service worker code is <1MB, but for example, when a change is needed, I've had to re-upload 10GB of ZIM files.
It is inefficient and costly and impossible to update all the ZIMs. The ZIM should not contain any of this logic, it should be coming from the server, either via service worker or via kiwix-lib.

Besides using the service worker, the rewriting system can also be implemented in python and the service worker code can essentially run in NodeJS as well. So you have at least two other options: run an embedded nodejs or python on the server, which is possible to do on both android and iOS, and then you don't need a service worker.

Note that service workers do work in iOS, but only on Safari, not in any other embedded browser (blame apple!).
Another option would be to just run kiwix-serve and load in Safari rather than embedded browser mode. This is why replayweb.page itself works fine on iOS as is.

On the other hand, if you want to build a custom rewriting system, that is incompatible with how wabac.js or pywb work, it is of course possible, but you'll need to maintain it on your own, and on each platform. I think that will increase the maintenance burden for kiwix considerably.

@mgautierfr
Copy link
Contributor

I think our main point of disagreement is where the rewriting is happening.

I disagree with that :)

I agree that these needs to happen when replaying, but this does not happen when scrapping and this logic should not be in the ZIM, but in the server or service worker level.

In zim/kiwix we used do the rewritting when scrapping, not replaying (and actually, there is no concept of replaying, we are just reading a archive).
But, indeed, there are some link we cannot catch when scrapping and so, we must catch them when reading.

The service worker code is <1MB, but for example, when a change is needed, I've had to re-upload 10GB of ZIM files

Yes, this is silly. But we already have that in another occasion. A wrong zim file is a wrong zim file. It is not to the reader to workaround that.
The fact that we need to redownload a full zim for a small change is a more global issue.

One main concept is that readers are not dependent of the zim content. Zim content follow "well known structure" and readers know only about this structure.
If the readers include evolving rules (because of workarounds or handling of new kind of content) it complicate a lot the system. We move from "You can get a zim file the way you want (even by copying from a usb stick in the middle of nowhere) and you can read it with the same tool than other zim" to "You may need a always up-to-date reader to read the zim file you've just got"

If there is a bug in the reader and a feature missing, we have no choice, we must update the reader. But I don't want the reader and the content being coupled.
If there is a needing feature missing to be able to have new kind of content in zim (the case here), we will extend the readers, but most of all, we will extend the "well know structure" and be future proof.

New versions of browsers may require the rewriting to be updated, or some bug is found, and recrawling an old site is not always possible.

If some bug is found, it is a faulty zim file. If recrawling is not possible, we can recreate a new zim file from a old one and fix bug at this moment.

I really doubt that a new version of a browser breaks a rewriting. And I have no numbers, but I think that most of our users actually don't use a browser to read a zim file (but native readers instead)

Besides using the service worker, the rewriting system can also be implemented in python and the service worker code can essentially run in NodeJS as well. So you have at least two other options: run an embedded nodejs or python on the server, which is possible to do on both android and iOS, and then you don't need a service worker

This is especially what I don't want.
I don't want to be dependent of a implementation I don't handle (especially in another language).
If we specify the rules as data, readers can do the rewrite as they want. Potentially use a node or python implementation if they want but not because they need to.


I don't want to enforce a technology. Everything in zim/kiwix is about data. Everything is specified (our should be) and user are free to use the technology/implementation they want.

Of course, we could use python/node to implement that. Then we could change our ios application, and then our desktop one, and the server and ...
But it is changing all our stack to fulfill the need of warc. And this will not got this way.

ios has no service worker, kiwix-js has no c++, kiwix-serve probably need a service worker, kiwix-desktop may have a service worker but we need to change all our request handling.

If we don't want to enforce a technology (a piece of algorithm implemented in one language), we need to do it the good old way : specified things (and so think about what we want to handle) as data and let readers implement the way they want (and we can provide default implementation as we do with libzim).

We was thinking that putting a webreplayer in a zim file was enough and can work with the "well know structure". But it appears it was a mistake and we actually need to extend the whole system. We are doing it right now, by understanding what are the spec, specifying what kind of data we want to handle and how.

So my main questions is for now : Do I miss something with the url rewriting. Can all rewrite rules be specified using regex ?

@ikreymer
Copy link
Collaborator

ikreymer commented Aug 25, 2020

@mgautierfr I think you are misunderstanding the scope of the problem..

In zim/kiwix we used do the rewritting when scrapping, not replaying (and actually, there is no concept of replaying, we are just reading a archive).

This approach works when you are reading a simple, static site, but we are talking about dynamic websites, which must emulate the complexity of any http server.

If the readers include evolving rules (because of workarounds or handling of new kind of content) it complicate a lot the system. We move from "You can get a zim file the way you want (even by copying from a usb stick in the middle of nowhere) and you can read it with the same tool than other zim" to "You may need a always up-to-date reader to read the zim file you've just got"

Yes! That is correct, if you want to archive any generic web page, including with latest features, you may need to update the reader for the page to work!

If there is a bug in the reader and a feature missing, we have no choice, we must update the reader. But I don't want the reader and the content being coupled.

For example, in archiving https://flexbooks.ck12.org/cbook/ck-12-precalculus-concepts-2.0/section/1.3/primary/lesson/point-notation-and-function-notation-pcalc, I found a small edge-case that required updating the reader.. The WARC I created was fine, but the wabac.js needed a small, generic update.

This is why WARC files only store the raw data generally, the 'rules' comprise a complex implementation and many different types of rewriting..

I don't want to enforce a technology. Everything in zim/kiwix is about data. Everything is specified (our should be) and user are free to use the technology/implementation they want.

Unfortunately, it just doesn't work that way. This is like saying, here is some HTML/CSS/JS, you are now free to write your own browser to render it! The complexity of replaying any website, which as I understand was the goal of the zimit project here, requires a complex system that can emulate various properties of an http server and manipulate client-side JS environment. Such a system is necessary to capture and replay, for example, an arbitrary page with embedded youtube videos, and many other examples.

This is just not a problem that can be solved with regexes, as I've mentioned additional issues above.
Webrecorder definitely needs better documentation of all of the complexities, and I hope to write this some day...

I think there's really two choices: the generic approach as was implemented here to replay dynamic websites, which requires a complex replay system (not just reading a zim file), or you can convert dynamic sites to static ones, on a case-by-case basis, as you (and Kolibri) and have already been doing. Or you can start something new, but I think it will still end up going in one of the two directions in the end.

@mgautierfr
Copy link
Contributor

@mgautierfr I think you are misunderstanding the scope of the problem..

Please help me understand the scope instead of saying me I'm wrong and that we must use wabac.js

Can you provide example of what url rewriting cannot be handled by regex ? (regex with backreference can be really powerful)

For example, in archiving flexbooks.ck12.org/cbook/ck-12-precalculus-concepts-2.0/section/1.3/primary/lesson/point-notation-and-function-notation-pcalc, I found a small edge-case that required updating the reader.. The WARC I created was fine, but the wabac.js needed a small, generic update.

What was the edges-case ? What change have you made ?

Yes! That is correct, if you want to archive any generic web page, including with latest features, you may need to update the reader for the page to work!

That is why standards exist. They can evolve with time and new feature but we must not do specific readers.

@kelson42 kelson42 pinned this issue Sep 4, 2020
@stale
Copy link

stale bot commented Oct 24, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Oct 24, 2020
@ikreymer
Copy link
Collaborator

@mgautierfr I think you are misunderstanding the scope of the problem..

Please help me understand the scope instead of saying me I'm wrong and that we must use wabac.js

Can you provide example of what url rewriting cannot be handled by regex ? (regex with backreference can be really powerful)

Just very quickly, the wabac.js system does the following:

  • For HTML, a full HTML parser is used and URLs in certain attributes are rewritten. Certain tags require custom handling, for example, <style> tags are rewritten using the CSS rewriter, <script> tags using the JS rewriter.
  • For CSS (including <style> tags), a regex is applied to rewrite url() entries in various forms.
  • For JS, URLs are not rewritten, but a system is applied to override access to window, document, location etc.. and code is injected to override a whole bunch of DOM functions that deal with URLs, and they rewrite URLs only when making a network request.
  • For JSONP, common arguments like jQuery1234_5678 are rewritten to match what is requested, as they may change due to timestamp (a form of fuzzy matching)
  • For Video, DASH/HLS manifests are rewritten to fix a particular resolution.
  • For certain other sites, certain JS is rewritten to disable DASH/HLS when possible.
  • For requests in general, a generic fuzzy matching system is applied, so a URL for https://example.com?_=1234 can be matched to a request for https://example.com?_=1235. In the full replayweb.page system, there is also prefix search, to find a 'best match' given, but since this is not possible, a fuzzy match redirect is added with its own rules. See Support extra fuzzy matching/canonicalization of certain URLs #64
  • For POST requests, some are converted to GET, and requesting the same URL, allows system to choose next POST request (also experimental).

Of course, not all of these approaches are needed for every site, but to have a generic system that works for as many sites as possible, this is what's needed. Yes, regexes are used throughout when actual rewriting is needed, but they alone insufficient to address all of these use cases for replaying complex websites.

At this point, this issue seems to be too broad, and maybe a new issue can be opened to address specific questions/suggestions?

@rgaudin rgaudin closed this as completed Nov 4, 2020
@rgaudin rgaudin unpinned this issue Nov 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Service Worker stale
Projects
None yet
Development

No branches or pull requests

6 participants