-
-
Notifications
You must be signed in to change notification settings - Fork 19
Support single Python version (3.12) and move rewriting stuff from warc2zim to zimscraperlib #204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #204 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 33 38 +5
Lines 1616 2219 +603
Branches 303 426 +123
==========================================
+ Hits 1616 2219 +603 ☔ View full report in Codecov by Sentry. |
ca31d4a to
cc7c21d
Compare
cc7c21d to
126436e
Compare
dc9e04a to
39ab439
Compare
|
@rgaudin good luck with this review, sorry for that, not sure how you can save time on this big thing ... especially since 90% of the code has already been reviewed in warc2zim ... |
rgaudin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree with your suggestion not to publish to NPM (at least for now), I am surprised that we dont publish it at all on releases.
We just have a dev version that fluctuates with main but we cannot refer to a specific version/build and that needs to be fixed.
If the goal is to build at runtime, then there should be no automatic/trasnparent fallback. Those who cant build it and want to use the fallback should be satisfied with a note in the README.
Beside this question that is not clear, it all looks good to me.
We do release / version / publish it in the wheel, and this might be sufficient from my (very recent and uncertain) PoV. The fallback is more for those who would prefer to use the sdist but don't mind / have sufficient knowledge / understanding of the project to build the JS part (since it is kind of a trick anyway). Let's discuss it live, I agree it is far from simple. |
|
New change:
|
rgaudin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I think the Software Arch and Technical Arch are both rewriting-specific and should be marked as such
rgaudin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Support single Python version
Maintaining multiple Python versions in scraperlib has proved to not bring much values, since we are the main users of the scraperlib, and other contributors starting a new scraper (me few years ago, joseph more recently) are happy to use a recent Python version which is known to be the "best" available.
It is also a pain, due to some things not being available in old Python versions. For instance, here while importing rewriting logic from warc2zim, the
walk_upoptions ofPurePosixPath.relative_tois not available before 3.12 ; and many other semantics were broken in 3.9 and 3.10 (optionals mainly).Old scraperlib supported old Python versions are still here on PyPi should someone still need to use an old Python, like we do on our "old" scrapers. And from experience, when it is time to use recent features of the scraperlib on a scraper, we always take the update opportunity to also update Python, since this is our goal / duty.
Move rewriting stuff from warc2zim to zimscraperlib
Rewriting CSS, JS and HTML is not something specific to warc2zim, we need to do it for libretexts as well for instance. While being definitely way simpler, it is still a shame to duplicate this logic in every scraper. This PR is hence moving most of the rewriting logic here. Most because the border between warc2zim specifics and general code was not always easy to draw. I'm pretty sure we will rework it slightly in the coming months. It is however far sufficient as a first attempt.
It means that we also moves fuzzy rules definitions to scraperlib, which is great (problematic URLs on one scraper will be probablematic on others as well).
It means that we also moves JS code to setup wombat (including dynamic rewriting with fuzzy rules) in the scraperlib.
The CI hence becomes a bit more complex to handle these JS things.
Changes
While most of the code is identical to warc2zim code, I had to change few things, and also took the opportunity to clean / streamline some obscure corners. Here are the changes:
mainbranch (with newPublishDev.yamlworkflow, and use it at python installation timeconsole.*statements in prod build of wombatSetup.js (fixwombatSetup.jsfloods console withurlRewritten:messages warc2zim#404) with @rollup/plugin-stripbuild_js.sh)Tests.yamlWhat is left to decide
Nothing, this has been discussed with rgaudin. Kept here for reference purpose on which decisions have been taken and why.
we could publish wombat-setup to NPM ; I finally see little added value in doing so for now, wombatSetup.js is not meant to be used as a dependency in a JS project yet (can we really imagine a Vue.JS with wombat?) ; CI is anyway already ready and tested, I just commented-it out for now=> we will not publish it to NPM for now ; we will however include wombat.js (license is AGPL 3 which is OK) and wombatSetup.js (our own code) in both the wheel and the sdist to ease installation and ensure versions are aligneddo we want to version wombatSetup.js with semver? I think there is no added value for now, just the build date/time (see first line of https://dev.kiwix.org/zimscraperlib/wombatSetup.js) is sufficient to debug what is going on, and this file is released with the versioned zimscraperlib anyway.=> no, inclusion in the sdist and the wheel ensures that it is "tight" to a zimscraperlib version, probably sufficient given current usageTesting
This PR has been not yet been tested with warc2zim or libretexts, it looks complex enough and it is not a big deal to have a second PR to fix some integration issues should we get into something.
The
Publishjobs have been tested for proper operation with hacks to run them on this branch temporarily and not real publish to their target but save artifacts temporarily. This worked well, so merging to main and then releasing should be relatively smooth.