Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rearchitecture MWOffliner HTML/CSS/JS scraping part #1830

Closed
kelson42 opened this issue Apr 12, 2023 · 2 comments · Fixed by #1886
Closed

Rearchitecture MWOffliner HTML/CSS/JS scraping part #1830

kelson42 opened this issue Apr 12, 2023 · 2 comments · Fixed by #1886
Assignees
Milestone

Comments

@kelson42
Copy link
Collaborator

kelson42 commented Apr 12, 2023

Mediawiki provides many API end-points to retrieve HTML/CSS/JS. This is because:

  • History (things are evolving, in particular in connection ot the wiki rewritting efforts)
  • Wikimedia has it's own set of API end-points at https://en.wikipedia.org/api/rest_v1/
  • Mobile or Desktop HTML (they are not the same in Mediawiki)

Depending of Mediawiki versions and how the Mediawiki is configured, not all of them are available.

Currently MWoffliner supports a few of them (see #1357 and the source code to get more details). But we will need to deprecate and introduce a few new ones: see #1664 and #1601.

Mediawiki API HTML/CSS/JS API end-points landscape is not that easy to understand. The other API end-points are more stable and available in all Mediawiki instances (therefore not a problem and not really a topic for this ticket). The only point which is important to understand is that they allow to retrieve HTML and the associated CSS/JS modules (there is a sophisticated module loader called "ResourceLoader") for each article.

We retrieve mobile HTML most of the time, but for a few things (or as fallback) we rely on the Desktop rendering.

In both cases we retrieve the HTML and run transformation on it to make is then directly usable in a ZIM file snapshot (so offline).

The problem is that the part in charge of retrieving the HTML and parsing it is quite a mess:

  • No clear separation between the pieces of code dedicated to specific API end-points
  • No common interface to use in the same way a module dealing with end-point number 1 or end-point number 2 (although they both provide a way to get HTML)
  • A general weak architecture around Mediawiki and Downloader which make any modification complicated

We need to take the code and revamp it a few classes which are clearly identiable (#1357 is a very small part of that job). Once done, it should be easy/clean to implement the support of additional end-points.

@VadimKovalenkoSNF
Copy link
Collaborator

I'm going to leave this diagram here as a reference. This is the very first version of organizing MW action and rest API on the high level for mwoffliner. Let me know if you expect any improvements/changes here.
mwoffliner-v0

@Dban1
Copy link

Dban1 commented Aug 29, 2023

#1892 (comment)

In the linked comment I think I have found another possible endpoint to query in HTML in absence of visual editor API.

@DonAlexandro DonAlexandro removed their assignment Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants