-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cross-browser Integration test related to warc2zim and wombat #101
Comments
@wsdookadr thank you for this. I agree that given how much we are dependent on other projects in warc2zim (and even more with zimit), it would be a very useful addition to test the whole solution and prevent regressions early. Thanks for the warc file ; haven't tested yet but could you list its content and provide the commands you used to create it? We certainly don't want such a test to depend on a non-editable/non-reproducible WARC (we already have such in current test and it's an issue). I will need to test your proposal to wrap my head around whether we can and/or should do this for the whole zimit solution (ie. generate the WARC with it) or just for warc2zim. Are you interested in building and contributing this test? I know I won't have time to work on this for a little while. |
contents of r1.zim (to generate see below)
I've used browsertrix-crawler version
Firefox sometimes hangs indefinitely on loading some of the resources, sometimes it fails to load them, whereas Chromium loads everything immediately.
Yes, I'll follow-up on this |
I think integration testing is definitely a good idea, perhaps via zimit directly. But I was not able to repro the particular issue referenced in webrecorder/wombat#93, either with WARC, WACZ or ZIM. As mentioned in the issue, I think unfortunately the confusion stems from Firefox not displaying service worker network traffic, while Chromium does (Looks like its an unfixed issue from six years ago: https://bugzilla.mozilla.org/show_bug.cgi?id=1267119). This makes debugging loading issues in Firefox that much more tricky, unofrunately. Here's a zim file I tested with, created from the attached WARC above via For me, the |
Taking a quick further look at that WARC, there are some 404 captures, for example:
This archived 404 page is actually added to the ZIM, but since ZIM doesn't have a concept of 404, we just serve the generic 404 error for this, though this may be confusing since just doing a zimdump will show that the ZIM contains |
In my custom implementation of Zimit URL translation (for scenarios that can't yet use Service Workers), I have to work around the fact that there are many apparent "assets" that are in fact served as 404 or similar pages with an html MIME type, because for whatever reason the asset wasn't found or was excluded or was a redirect. Since we are very dependent on MIME types in our JavaScript reader (Kiwix JS variants), this is one of the significant challenges of reading these ZIM archives. I have to deal with cases where we're trying to load In fact the OpenZIM formet does have a |
URI encoding issues have been fixed, so I do not expect to have a repro in Zimit2 / Warc2zim2 and there is no clear mention of a specific website which fails to be converted into a ZIM, so I can't test either. The core of this issue (the idea of having a fully automated test suite comparing results across browsers) is interesting but clearly not something we will have resources to invest in the coming months / years, so I'm removing any milestone for now. |
I found a particular case where wombat will work one way on Firefox and a different way on Chromium ( tl;dr An incorrect uri request for a resource on Firefox but correct on Chromium, related to URI encoding ).
warc2zim has wombat as a dependency but to reproduce the entire scenario both are needed.
One way to enhance both would be the writing of an integration test for warc2zim, so they both get to work in conjunction.
The following steps should work fine to test this on Chromium but with some changes they will work on Firefox too (Firefox some changes are needed because it requires changing the user profile, instead of passing the proxy server as a CLI argument).
Here is the set of steps below:
~/.local/bin/mitmdump -m socks5 -p 9001 -s ./har_dump.py --set hardump=./dump.har
(using har_dump.py , tested and worked with mitmproxy 8.1.1 installed from pypi )CHROMIUM_FLAGS="--disk-cache-dir=/dev/null --disk-cache-size=1" chromium --ignore-certificate-errors --proxy-server="socks5://localhost:9001" "http://localhost:8083/big_2022-07/A/abstract.ups.edu/aata/section-method-of-repeated-squares.html"
. So far this will load the web page from the ZIM in the browser and will request all needed resources. It will drop any SSL checks and it will disable caches making sure that all resources involved will be loadedThe test itself would check for the following properties:
/A/mp_
or/A/cs_
(or any other similar prefix) , if the part after this prefix (the actual uri) is available in the original WARC, then the request should be successful in the HAR also (because the data is known to be available), so with status code 2xx or 3xx but not 4xx. Example: Let's say the WARC containshttps://www.google.com/cse/static/style/look/v4/default.css
, then in both HAR files, we should see requests with this suffix, and they should all be successful.Maybe all of this could be done in the wombat test suite, but this is meant as an integration test to make sure all the components are working together as expected.
Note: Another example of different outcomes in Chrome compared to Firefox is available here.
Test warc file:
174.warc.gz
The text was updated successfully, but these errors were encountered: