Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross-browser Integration test related to warc2zim and wombat #101

Open
wsdookadr opened this issue Jul 22, 2022 · 6 comments
Open

Cross-browser Integration test related to warc2zim and wombat #101

wsdookadr opened this issue Jul 22, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@wsdookadr
Copy link
Contributor

wsdookadr commented Jul 22, 2022

I found a particular case where wombat will work one way on Firefox and a different way on Chromium ( tl;dr An incorrect uri request for a resource on Firefox but correct on Chromium, related to URI encoding ).

warc2zim has wombat as a dependency but to reproduce the entire scenario both are needed.
One way to enhance both would be the writing of an integration test for warc2zim, so they both get to work in conjunction.

The following steps should work fine to test this on Chromium but with some changes they will work on Firefox too (Firefox some changes are needed because it requires changing the user profile, instead of passing the proxy server as a CLI argument).
Here is the set of steps below:

  • a WARC is fed into warc2zim (I'll provide one) and a ZIM is produced
  • kiwix-serve is run to serve requests with data from the ZIM
  • an intercepting server is started ~/.local/bin/mitmdump -m socks5 -p 9001 -s ./har_dump.py --set hardump=./dump.har (using har_dump.py , tested and worked with mitmproxy 8.1.1 installed from pypi )
  • a container with Chromium is started, with the following parameters: CHROMIUM_FLAGS="--disk-cache-dir=/dev/null --disk-cache-size=1" chromium --ignore-certificate-errors --proxy-server="socks5://localhost:9001" "http://localhost:8083/big_2022-07/A/abstract.ups.edu/aata/section-method-of-repeated-squares.html" . So far this will load the web page from the ZIM in the browser and will request all needed resources. It will drop any SSL checks and it will disable caches making sure that all resources involved will be loaded
  • After a number of seconds or some other more appropriate stop condition, both the intercepting server and the browser will be stopped. This will give mitmdump the chance to flush the HAR data to disk
  • All status codes and URIs for all resources present in the HAR file will be extracted

The test itself would check for the following properties:

  1. The requests that were made by Chromium and Firefox (as reflected by the two HAR files generated for both browsers) should be the same, at least in terms of URI and HTTP status code
  2. Regardless of request, if it was to /A/mp_ or /A/cs_ (or any other similar prefix) , if the part after this prefix (the actual uri) is available in the original WARC, then the request should be successful in the HAR also (because the data is known to be available), so with status code 2xx or 3xx but not 4xx. Example: Let's say the WARC contains https://www.google.com/cse/static/style/look/v4/default.css, then in both HAR files, we should see requests with this suffix, and they should all be successful.

Maybe all of this could be done in the wombat test suite, but this is meant as an integration test to make sure all the components are working together as expected.

Note: Another example of different outcomes in Chrome compared to Firefox is available here.

Test warc file:

174.warc.gz

@wsdookadr wsdookadr changed the title Proposal: Integration test related to warc2zim and wombat Proposal: Cross-browser Integration test related to warc2zim and wombat Jul 22, 2022
@rgaudin
Copy link
Member

rgaudin commented Jul 22, 2022

@wsdookadr thank you for this. I agree that given how much we are dependent on other projects in warc2zim (and even more with zimit), it would be a very useful addition to test the whole solution and prevent regressions early.

Thanks for the warc file ; haven't tested yet but could you list its content and provide the commands you used to create it? We certainly don't want such a test to depend on a non-editable/non-reproducible WARC (we already have such in current test and it's an issue).

I will need to test your proposal to wrap my head around whether we can and/or should do this for the whole zimit solution (ie. generate the WARC with it) or just for warc2zim.

Are you interested in building and contributing this test? I know I won't have time to work on this for a little while.

@rgaudin rgaudin added the enhancement New feature or request label Jul 22, 2022
@rgaudin rgaudin changed the title Proposal: Cross-browser Integration test related to warc2zim and wombat Cross-browser Integration test related to warc2zim and wombat Jul 22, 2022
@wsdookadr
Copy link
Contributor Author

wsdookadr commented Jul 22, 2022

@wsdookadr thank you for this. I agree that given how much we are dependent on other projects in warc2zim (and even more with zimit), it would be a very useful addition to test the whole solution and prevent regressions early.

Thanks for the warc file ; haven't tested yet but could you list its content

contents of r1.zim (to generate see below)
user@garage3:/tmp/sandbox$ ~/zim-bench/math-blogs/deps/zim-tools_linux-x86_64-3.1.1-1/zimdump list ./r1.zim
A/404.html
A/abstract.ups.edu/aata/developer.css
A/abstract.ups.edu/aata/external/cover/cover_aata_2021.png
A/abstract.ups.edu/aata/section-method-of-repeated-squares.html
A/abstract.ups.edu/aata/section-method-of-repeated-squares.ptx
A/c.statcounter.com/t.php?sc_project=10568106&u1=DC95827349C64F77F869F6FD9D0305F5&java=1&security=510ba080&sc_snum=1&sess=a8f3c4&p=0&rcat=d&rdom=d&rdomg=new&bb=1&jg=new&rr=1.1.1.1.1.1.1.1.1&resolution=800&h=600&camefrom=&u=http%3A//abstract.ups.edu/aata/section-method-of-repeated-squares.html&t=AATA%20The%20Method%20of%20Repeated%20Squares&invisible=1&sc_rum_e_s=4266&sc_rum_e_e=4269&sc_rum_f_s=0&sc_rum_f_e=4264&get_config=true
A/cdn.jsdelivr.net/npm/mathjax@3/es5/input/asciimath.js
A/cdn.jsdelivr.net/npm/mathjax@3/es5/input/tex/extensions/amscd.js
A/cdn.jsdelivr.net/npm/mathjax@3/es5/input/tex/extensions/extpfeil.js
A/cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff
A/cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff
A/cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff
A/cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff
A/cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js
A/cse.google.com/adsense/search/async-ads.js
A/cse.google.com/cse.js?cx=002637997310187229905:qj2oy0jlpyu
A/fonts.googleapis.com/css?family=Inconsolata:400,700&subset=latin,latin-ext
A/fonts.googleapis.com/css?family=Open+Sans:400,400italic,600,600italic
A/fonts.googleapis.com/css?family=PT+Serif:400,700,400italic,700italic|Open+Sans:400italic,700italic,400,700
A/fonts.gstatic.com/s/opensans/v29/memvYaGs126MiZpBA-UvWbX2vVnXBbObj2OVTS-muw.woff2
A/fonts.gstatic.com/s/ptserif/v17/EJRSQgYoZZY2vCFuvAnt66qSVys.woff2
A/fonts.gstatic.com/s/ptserif/v17/EJRVQgYoZZY2vCFuvAFWzr8.woff2
A/index.html
A/load.js
A/pretextbook.org/css/0.31/banner_default.css
A/pretextbook.org/css/0.31/colors_default.css
A/pretextbook.org/css/0.31/knowls_default.css
A/pretextbook.org/css/0.31/pretext.css
A/pretextbook.org/css/0.31/pretext_add_on.css
A/pretextbook.org/css/0.31/setcolors.css
A/pretextbook.org/css/0.31/style_default.css
A/pretextbook.org/css/0.31/toc_default.css
A/pretextbook.org/js/0.13/pretext.js
A/pretextbook.org/js/0.13/pretext_add_on.js
A/pretextbook.org/js/lib/jquery.espy.min.js
A/pretextbook.org/js/lib/jquery.min.js
A/pretextbook.org/js/lib/jquery.sticky.js
A/pretextbook.org/js/lib/knowl.js
A/pretextbook.org/js/lib/mathjaxknowl3.js
A/sagecell.sagemath.org/embedded_sagecell.js
A/sagecell.sagemath.org/static/spinner.gif
A/sw.js
A/topFrame.html
A/unpkg.com/prismjs@1.22.0/components/prism-core.min.js
A/unpkg.com/prismjs@1.22.0/plugins/autoloader/prism-autoloader.min.js
A/unpkg.com/prismjs@1.22.0/themes/prism.css
A/unpkg.com/prismjs@v1.22.0/components/prism-core.min.js
A/unpkg.com/prismjs@v1.22.0/plugins/autoloader/prism-autoloader.min.js
A/unpkg.com/prismjs@v1.22.0/themes/prism.css
A/www.google-analytics.com/ga.js
A/www.google.com/cse/static/css/v2/clear.png
A/www.google.com/cse/static/element/3e1664f444e6eb06/cse_element__en.js?usqp=CAI%3D
A/www.google.com/cse/static/element/3e1664f444e6eb06/default+en.css
A/www.google.com/cse/static/images/1x/en/branding.png
A/www.google.com/cse/static/style/look/v4/default.css
A/www.mathjax.org/badge/badge.gif
A/www.statcounter.com/counter/counter.js
H/abstract.ups.edu/aata/developer.css
H/abstract.ups.edu/aata/external/cover/cover_aata_2021.png
H/abstract.ups.edu/aata/section-method-of-repeated-squares.html
H/abstract.ups.edu/aata/section-method-of-repeated-squares.ptx
H/c.statcounter.com/t.php?sc_project=10568106&u1=DC95827349C64F77F869F6FD9D0305F5&java=1&security=510ba080&sc_snum=1&sess=a8f3c4&p=0&rcat=d&rdom=d&rdomg=new&bb=1&jg=new&rr=1.1.1.1.1.1.1.1.1&resolution=800&h=600&camefrom=&u=http%3A//abstract.ups.edu/aata/section-method-of-repeated-squares.html&t=AATA%20The%20Method%20of%20Repeated%20Squares&invisible=1&sc_rum_e_s=4266&sc_rum_e_e=4269&sc_rum_f_s=0&sc_rum_f_e=4264&get_config=true
H/cdn.jsdelivr.net/npm/mathjax@3/es5/input/asciimath.js
H/cdn.jsdelivr.net/npm/mathjax@3/es5/input/tex/extensions/amscd.js
H/cdn.jsdelivr.net/npm/mathjax@3/es5/input/tex/extensions/extpfeil.js
H/cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff
H/cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff
H/cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff
H/cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff
H/cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js
H/clients1.google.com/generate_204
H/cse.google.com/adsense/search/async-ads.js
H/cse.google.com/cse.js?cx=002637997310187229905:qj2oy0jlpyu
H/fonts.googleapis.com/css?family=Inconsolata:400,700&subset=latin,latin-ext
H/fonts.googleapis.com/css?family=Open+Sans:400,400italic,600,600italic
H/fonts.googleapis.com/css?family=PT+Serif:400,700,400italic,700italic|Open+Sans:400italic,700italic,400,700
H/fonts.gstatic.com/s/opensans/v29/memvYaGs126MiZpBA-UvWbX2vVnXBbObj2OVTS-muw.woff2
H/fonts.gstatic.com/s/ptserif/v17/EJRSQgYoZZY2vCFuvAnt66qSVys.woff2
H/fonts.gstatic.com/s/ptserif/v17/EJRVQgYoZZY2vCFuvAFWzr8.woff2
H/pretextbook.org/css/0.31/banner_default.css
H/pretextbook.org/css/0.31/colors_default.css
H/pretextbook.org/css/0.31/knowls_default.css
H/pretextbook.org/css/0.31/pretext.css
H/pretextbook.org/css/0.31/pretext_add_on.css
H/pretextbook.org/css/0.31/setcolors.css
H/pretextbook.org/css/0.31/style_default.css
H/pretextbook.org/css/0.31/toc_default.css
H/pretextbook.org/js/0.13/pretext.js
H/pretextbook.org/js/0.13/pretext_add_on.js
H/pretextbook.org/js/lib/jquery.espy.min.js
H/pretextbook.org/js/lib/jquery.min.js
H/pretextbook.org/js/lib/jquery.sticky.js
H/pretextbook.org/js/lib/knowl.js
H/pretextbook.org/js/lib/mathjaxknowl3.js
H/sagecell.sagemath.org/embedded_sagecell.js
H/sagecell.sagemath.org/static/spinner.gif
H/unpkg.com/prismjs@1.22.0/components/prism-core.min.js
H/unpkg.com/prismjs@1.22.0/plugins/autoloader/prism-autoloader.min.js
H/unpkg.com/prismjs@1.22.0/themes/prism.css
H/unpkg.com/prismjs@v1.22.0/components/prism-core.min.js
H/unpkg.com/prismjs@v1.22.0/plugins/autoloader/prism-autoloader.min.js
H/unpkg.com/prismjs@v1.22.0/themes/prism.css
H/www.google-analytics.com/ga.js
H/www.google.com/cse/static/css/v2/clear.png
H/www.google.com/cse/static/element/3e1664f444e6eb06/cse_element__en.js?usqp=CAI%3D
H/www.google.com/cse/static/element/3e1664f444e6eb06/default+en.css
H/www.google.com/cse/static/images/1x/en/branding.png
H/www.google.com/cse/static/style/look/v4/default.css
H/www.googleapis.com/generate_204
H/www.mathjax.org/badge/badge.gif
H/www.statcounter.com/counter/counter.js

and provide the commands you used to create it?

I've used browsertrix-crawler version 0.7.0-beta.1 from this docker image . In my case, instead of --combineWARC=true I've used a script I wrote based on warcio to join the pieces together, but overall it's the same idea and the result should be the same.

docker run --rm=true -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --scopeType any --headless --limit 1 --waitUntil networkidle0 --timeLimit 15 --url "http://abstract.ups.edu/aata/section-method-of-repeated-squares.html" --combineWARC=true --logging pywb,stats  > logs.txt 2>&1
find -maxdepth 4 -name "*.warc.gz" -exec cp {} r1.warc.gz \;
warc2zim --verbose --lang eng --output ./output --name "big" r1.warc.gz
find output/ -name "*.zim" -exec cp {} r1.zim \;
./kiwix-tools_linux-x86_64-3.3.0/kiwix-serve -i 127.0.0.1 -p 8083 ./r1.zim

Firefox sometimes hangs indefinitely on loading some of the resources, sometimes it fails to load them, whereas Chromium loads everything immediately.

Are you interested in building and contributing this test? I know I won't have time to work on this for a little while.

Yes, I'll follow-up on this

@ikreymer
Copy link
Collaborator

ikreymer commented Aug 1, 2022

I think integration testing is definitely a good idea, perhaps via zimit directly.

But I was not able to repro the particular issue referenced in webrecorder/wombat#93, either with WARC, WACZ or ZIM.

As mentioned in the issue, I think unfortunately the confusion stems from Firefox not displaying service worker network traffic, while Chromium does (Looks like its an unfixed issue from six years ago: https://bugzilla.mozilla.org/show_bug.cgi?id=1267119). This makes debugging loading issues in Firefox that much more tricky, unofrunately.
But in this case, the request does seem to be loaded as normal.

Here's a zim file I tested with, created from the attached WARC above via warc2zim --verbose --lang eng --output ./ --name 174 174.warc.gz (renamed to .zip for attachment).
174_2022-08.zip

For me, the /174_2022-08/A/cs_/https://www.google.com/cse/static/style/look/v4/default.css is loaded correctly in both Chrome and Firefox.

@ikreymer
Copy link
Collaborator

ikreymer commented Aug 1, 2022

Taking a quick further look at that WARC, there are some 404 captures, for example:

edu,ups,abstract)/aata/developer.css 20220714040352 {"url": "http://abstract.ups.edu/aata/developer.css", "mime": "text/html", "status": "404", "digest": "sha1:VAQZB7CTBQTFVJAKAROCC5YNSZ7UOZ5Y", "length": "626", "offset": "9024", "filename": "174.warc.gz"}

This archived 404 page is actually added to the ZIM, but since ZIM doesn't have a concept of 404, we just serve the generic 404 error for this, though this may be confusing since just doing a zimdump will show that the ZIM contains A/abstract.ups.edu/aata/developer.css and H/abstract.ups.edu/aata/developer.css, which is actually just a 404 page, and A/abstract.ups.edu/aata/developer.css is never served.

@Jaifroid
Copy link

Jaifroid commented Sep 9, 2022

@ikreymer wrote:
Taking a quick further look at that WARC, there are some 404 captures ...
This archived 404 page is actually added to the ZIM, but since ZIM doesn't have a concept of 404, we just serve the generic 404 error for this, though this may be confusing since just doing a zimdump will show that the ZIM contains A/abstract.ups.edu/aata/developer.css and H/abstract.ups.edu/aata/developer.css, which is actually just a 404 page, and A/abstract.ups.edu/aata/developer.css is never served.

In my custom implementation of Zimit URL translation (for scenarios that can't yet use Service Workers), I have to work around the fact that there are many apparent "assets" that are in fact served as 404 or similar pages with an html MIME type, because for whatever reason the asset wasn't found or was excluded or was a redirect. Since we are very dependent on MIME types in our JavaScript reader (Kiwix JS variants), this is one of the significant challenges of reading these ZIM archives. I have to deal with cases where we're trying to load ìmage.png (for example), but we get a MIME type from the ZIM of text/html, which may or may not ultimately lead to the image. And we can't rely on the file extension to tell us what we are dealing with (or should be dealing with), as that may not be present for many asset types.

In fact the OpenZIM formet does have a redirect concept (a type of dirEntry), but this isn't used for moved or redirected assets in warc2zim. I can understand why, as it is a low-level thing, but it doesn't surprise me that the inclusion of 404, 301, etc. pages as responses to asset requests could cause problems elsewhere.

@kelson42 kelson42 added this to the 1.6.0 milestone Apr 24, 2023
@benoit74
Copy link
Collaborator

URI encoding issues have been fixed, so I do not expect to have a repro in Zimit2 / Warc2zim2 and there is no clear mention of a specific website which fails to be converted into a ZIM, so I can't test either.

The core of this issue (the idea of having a fully automated test suite comparing results across browsers) is interesting but clearly not something we will have resources to invest in the coming months / years, so I'm removing any milestone for now.

@benoit74 benoit74 removed this from the 2.0.0 milestone May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants