Skip to content

Conversation

@benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Oct 17, 2024

Support single Python version

Maintaining multiple Python versions in scraperlib has proved to not bring much values, since we are the main users of the scraperlib, and other contributors starting a new scraper (me few years ago, joseph more recently) are happy to use a recent Python version which is known to be the "best" available.

It is also a pain, due to some things not being available in old Python versions. For instance, here while importing rewriting logic from warc2zim, the walk_up options of PurePosixPath.relative_to is not available before 3.12 ; and many other semantics were broken in 3.9 and 3.10 (optionals mainly).

Old scraperlib supported old Python versions are still here on PyPi should someone still need to use an old Python, like we do on our "old" scrapers. And from experience, when it is time to use recent features of the scraperlib on a scraper, we always take the update opportunity to also update Python, since this is our goal / duty.

Move rewriting stuff from warc2zim to zimscraperlib

Rewriting CSS, JS and HTML is not something specific to warc2zim, we need to do it for libretexts as well for instance. While being definitely way simpler, it is still a shame to duplicate this logic in every scraper. This PR is hence moving most of the rewriting logic here. Most because the border between warc2zim specifics and general code was not always easy to draw. I'm pretty sure we will rework it slightly in the coming months. It is however far sufficient as a first attempt.

It means that we also moves fuzzy rules definitions to scraperlib, which is great (problematic URLs on one scraper will be probablematic on others as well).

It means that we also moves JS code to setup wombat (including dynamic rewriting with fuzzy rules) in the scraperlib.

The CI hence becomes a bit more complex to handle these JS things.

Changes

While most of the code is identical to warc2zim code, I had to change few things, and also took the opportunity to clean / streamline some obscure corners. Here are the changes:

  • automatically publish compiled wombatSetup.js to https://dev.kiwix.org/zimscraperlib/wombatSetup.js at every push on main branch (with new PublishDev.yaml workflow, and use it at python installation time
  • automatically add version as comment (+ debug log in dev) in wombatSetup.js with rollup-plugin-version-injector
  • automatically remove console.* statements in prod build of wombatSetup.js (fix wombatSetup.js floods console with urlRewritten: messages warc2zim#404) with @rollup/plugin-strip
  • simplify maintenance of CI with multiple chained jobs (avoid the dirty hack of warc2zim with a custom docker image to run stuff, with custom build_js.sh)
  • properly test build/packaging of JS code in Tests.yaml
  • slightly modify rewriting logic to uncouple few things
  • move tests, add more to ensure 100% coverage and proper testing even without warc2zim specific tests
  • move relevant documentation

What is left to decide

Nothing, this has been discussed with rgaudin. Kept here for reference purpose on which decisions have been taken and why.

  • we could publish wombat-setup to NPM ; I finally see little added value in doing so for now, wombatSetup.js is not meant to be used as a dependency in a JS project yet (can we really imagine a Vue.JS with wombat?) ; CI is anyway already ready and tested, I just commented-it out for now => we will not publish it to NPM for now ; we will however include wombat.js (license is AGPL 3 which is OK) and wombatSetup.js (our own code) in both the wheel and the sdist to ease installation and ensure versions are aligned
  • do we want to version wombatSetup.js with semver? I think there is no added value for now, just the build date/time (see first line of https://dev.kiwix.org/zimscraperlib/wombatSetup.js) is sufficient to debug what is going on, and this file is released with the versioned zimscraperlib anyway. => no, inclusion in the sdist and the wheel ensures that it is "tight" to a zimscraperlib version, probably sufficient given current usage

Testing

This PR has been not yet been tested with warc2zim or libretexts, it looks complex enough and it is not a big deal to have a second PR to fix some integration issues should we get into something.

The Publish jobs have been tested for proper operation with hacks to run them on this branch temporarily and not real publish to their target but save artifacts temporarily. This worked well, so merging to main and then releasing should be relatively smooth.

@benoit74 benoit74 self-assigned this Oct 17, 2024
@codecov
Copy link

codecov bot commented Oct 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (8b6cbac) to head (bed3960).
Report is 7 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##              main      #204    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           33        38     +5     
  Lines         1616      2219   +603     
  Branches       303       426   +123     
==========================================
+ Hits          1616      2219   +603     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benoit74 benoit74 force-pushed the content_rewriting branch 12 times, most recently from ca31d4a to cc7c21d Compare October 18, 2024 13:05
@benoit74 benoit74 changed the title Move rewriting stuff from warc2zim to zimscraperlib Support single Python version and move rewriting stuff from warc2zim to zimscraperlib Oct 18, 2024
@benoit74 benoit74 changed the title Support single Python version and move rewriting stuff from warc2zim to zimscraperlib Support single Python version (3.12) and move rewriting stuff from warc2zim to zimscraperlib Oct 18, 2024
@benoit74 benoit74 marked this pull request as ready for review October 18, 2024 13:40
@benoit74 benoit74 requested a review from rgaudin October 18, 2024 13:40
@benoit74
Copy link
Collaborator Author

@rgaudin good luck with this review, sorry for that, not sure how you can save time on this big thing ... especially since 90% of the code has already been reviewed in warc2zim ...

Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree with your suggestion not to publish to NPM (at least for now), I am surprised that we dont publish it at all on releases.

We just have a dev version that fluctuates with main but we cannot refer to a specific version/build and that needs to be fixed.

If the goal is to build at runtime, then there should be no automatic/trasnparent fallback. Those who cant build it and want to use the fallback should be satisfied with a note in the README.

Beside this question that is not clear, it all looks good to me.

@benoit74
Copy link
Collaborator Author

We just have a dev version that fluctuates with main but we cannot refer to a specific version/build and that needs to be fixed.

We do release / version / publish it in the wheel, and this might be sufficient from my (very recent and uncertain) PoV. The fallback is more for those who would prefer to use the sdist but don't mind / have sufficient knowledge / understanding of the project to build the JS part (since it is kind of a trick anyway).

Let's discuss it live, I agree it is far from simple.

@benoit74
Copy link
Collaborator Author

New change:

  • small fixes linked to conversations (update dev dependencies, associated fixes, ...)
  • move relevant documentation from warc2zim to here
  • include wombat.js, wombatSetup.js and generated rules files in both sdist and wheel (and not just the wheel)

@benoit74 benoit74 requested a review from rgaudin October 21, 2024 09:58
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I think the Software Arch and Technical Arch are both rewriting-specific and should be marked as such

@benoit74 benoit74 requested a review from rgaudin October 21, 2024 12:49
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@benoit74 benoit74 merged commit 570450a into main Oct 22, 2024
@benoit74 benoit74 deleted the content_rewriting branch October 22, 2024 06:58
@benoit74 benoit74 added this to the 5.0.0 milestone Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

wombatSetup.js floods console with urlRewritten: messages

3 participants