Rearchitecture MWOffliner HTML/CSS/JS scraping part #1830

kelson42 · 2023-04-12T13:55:21Z

Mediawiki provides many API end-points to retrieve HTML/CSS/JS. This is because:

History (things are evolving, in particular in connection ot the wiki rewritting efforts)
Wikimedia has it's own set of API end-points at https://en.wikipedia.org/api/rest_v1/
Mobile or Desktop HTML (they are not the same in Mediawiki)

Depending of Mediawiki versions and how the Mediawiki is configured, not all of them are available.

Currently MWoffliner supports a few of them (see #1357 and the source code to get more details). But we will need to deprecate and introduce a few new ones: see #1664 and #1601.

Mediawiki API HTML/CSS/JS API end-points landscape is not that easy to understand. The other API end-points are more stable and available in all Mediawiki instances (therefore not a problem and not really a topic for this ticket). The only point which is important to understand is that they allow to retrieve HTML and the associated CSS/JS modules (there is a sophisticated module loader called "ResourceLoader") for each article.

We retrieve mobile HTML most of the time, but for a few things (or as fallback) we rely on the Desktop rendering.

In both cases we retrieve the HTML and run transformation on it to make is then directly usable in a ZIM file snapshot (so offline).

The problem is that the part in charge of retrieving the HTML and parsing it is quite a mess:

No clear separation between the pieces of code dedicated to specific API end-points
No common interface to use in the same way a module dealing with end-point number 1 or end-point number 2 (although they both provide a way to get HTML)
A general weak architecture around Mediawiki and Downloader which make any modification complicated

We need to take the code and revamp it a few classes which are clearly identiable (#1357 is a very small part of that job). Once done, it should be easy/clean to implement the support of additional end-points.

VadimKovalenkoSNF · 2023-08-01T05:20:07Z

I'm going to leave this diagram here as a reference. This is the very first version of organizing MW action and rest API on the high level for mwoffliner. Let me know if you expect any improvements/changes here.

Dban1 · 2023-08-29T13:12:52Z

#1892 (comment)

In the linked comment I think I have found another possible endpoint to query in HTML in absence of visual editor API.

kelson42 added the enhancement label Apr 12, 2023

kelson42 added this to the 1.14.0 milestone Apr 12, 2023

DonAlexandro mentioned this issue May 11, 2023

Rearchitecture MWOffliner HTML/CSS/JS scraping (part #1) #1839

Merged

DonAlexandro mentioned this issue Jun 13, 2023

New URLs builders for Downloader and Mediawiki classes #1854

Merged

kelson42 modified the milestones: 2.1.0, 2.0.0 Jul 18, 2023

kelson42 assigned DonAlexandro and VadimKovalenkoSNF Jul 28, 2023

kelson42 mentioned this issue Aug 22, 2023

1881/modularization article treatment - Rearchitecture MWOffliner HTML/CSS/JS scraping (part #2) #1886

Merged

VadimKovalenkoSNF mentioned this issue Aug 26, 2023

Unable to find appropriate API end-point to retrieve article HTML #1892

Closed

kelson42 modified the milestones: 2.0.0, 1.14.0 Aug 28, 2023

kelson42 closed this as completed in #1886 Sep 7, 2023

DonAlexandro removed their assignment Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rearchitecture MWOffliner HTML/CSS/JS scraping part #1830

Rearchitecture MWOffliner HTML/CSS/JS scraping part #1830

kelson42 commented Apr 12, 2023 •

edited

VadimKovalenkoSNF commented Aug 1, 2023

Dban1 commented Aug 29, 2023

Rearchitecture MWOffliner HTML/CSS/JS scraping part #1830

Rearchitecture MWOffliner HTML/CSS/JS scraping part #1830

Comments

kelson42 commented Apr 12, 2023 • edited

VadimKovalenkoSNF commented Aug 1, 2023

Dban1 commented Aug 29, 2023

kelson42 commented Apr 12, 2023 •

edited