Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images from page/mobile-html endpoint are too big #1925

Open
VadimKovalenkoSNF opened this issue Oct 10, 2023 · 6 comments
Open

Images from page/mobile-html endpoint are too big #1925

VadimKovalenkoSNF opened this issue Oct 10, 2023 · 6 comments
Assignees
Labels
Milestone

Comments

@VadimKovalenkoSNF
Copy link
Collaborator

WikimediaMobile API in #1903 relies on page/mobile-html endpoint when scraping Wikipedia articles. Most of the images that come from mobile-html are 640px in width which is not appropriate for the scrape process, because of the drastic increase of the final zim file. Check this article's images as an example - https://bm.wikipedia.org/api/rest_v1/page/mobile-html/Bamak%C9%94
Pay attention to width value in src attribute for each image, e.g https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Bamako_bridge2.jpg/640px-Bamako_bridge2.jpg has /640px placed there by mediawiki mobileapps service.

Related ticket in Phabricator: https://phabricator.wikimedia.org/T348529

@VadimKovalenkoSNF VadimKovalenkoSNF changed the title Images from page/mobile-html endpoind are to big Images from page/mobile-html endpoint are to big Oct 10, 2023
@kelson42 kelson42 added this to the 1.15.0 milestone Oct 13, 2023
@kelson42
Copy link
Collaborator

kelson42 commented Oct 13, 2023

Most of the images that come from mobile-html are 640px in width which is not appropriate for the scrape process,

No "Its not appropriate for a mobile end-point AFAIK". See https://www.browserstack.com/guide/ideal-screen-sizes-for-responsive-design#toc2. We should report this as a bug not as feature request.

@kelson42 kelson42 changed the title Images from page/mobile-html endpoint are to big Images from page/mobile-html endpoint are too big Oct 13, 2023
@kelson42
Copy link
Collaborator

Bug report done at https://phabricator.wikimedia.org/T349972

@kelson42 kelson42 self-assigned this Oct 29, 2023
@cscott
Copy link
Contributor

cscott commented Dec 19, 2023

Be careful that you're looking at the raw HTML served to the browser, not the HTML as modified by the lazy-loading mechanism. Fetching that HTML gives:

<td align="center" colspan="2" style="background:#f9f9f9;"><span class="mw-default-size"><a href="./Fichie
r:Mali-Bamako.png" class="mw-file-description" title="Bamako Mali kɔnɔ"><span class="mw-file-element pcs-l
azy-load-placeholder pcs-lazy-load-placeholder-pending" style="width: 200px;" data-class="mw-file-element"
 data-src="//upload.wikimedia.org/wikipedia/commons/4/46/Mali-Bamako.png" data-width="200" data-height="18
7" data-alt="Bamako Mali kɔnɔ" data-data-file-width="200" data-data-file-height="187"><span style="padding
-top: 93.5%;"></span></span></a></span></td></tr>
</tbody></table>

<figure class="pcs-widen-image-ancestor"><a href="./Fichier:Bamako_et_fleuve_Niger.jpg" class="mw-file-des
cription pcs-widen-image-ancestor"><span class="mw-file-element pcs-widen-image-override pcs-lazy-load-pla
ceholder pcs-lazy-load-placeholder-pending" style="width: 320px;" data-class="mw-file-element pcs-widen-im
age-override" data-src="//upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Bamako_et_fleuve_Niger.jpg/320
px-Bamako_et_fleuve_Niger.jpg" data-srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Bamako_et_
fleuve_Niger.jpg/480px-Bamako_et_fleuve_Niger.jpg 1.5x" data-width="320" data-height="241" data-data-file-
width="600" data-data-file-height="450"><span style="padding-top: 75.3125%;"></span></span></a><figcaption
>Bamako</figcaption></figure>

There's no actual <img> tag there, it's all lazy loaded. Given that you are already presumably processing these lazy-load attributes in order to fetch the underlying resource, you can substitute any size preference you like, right?

The more fundamental question is whether or not kiwix wants to be fetching the "mobile HTML" in the first place, as opposed to the HTML used for desktop browsing.

@kelson42
Copy link
Collaborator

kelson42 commented Dec 19, 2023

Be careful that you're looking at the raw HTML served to the browser, not the HTML as modified by the lazy-loading mechanism. Fetching that HTML gives:

Yes, this really a pain to have a public API not delivering proper HTML! But:

  • This is not the problem I reported
  • We always had to make this kind of transformation around thumbnails and I guess we will have to live with it and handle the transformation ourself...

There's no actual <img> tag there, it's all lazy loaded. Given that you are already presumably processing these lazy-load attributes in order to fetch the underlying resource, you can substitute any size preference you like, right?

Not sure what I should answer here... basically you ask us to (for each picture):

  • Discover on our own the size (width) which is specified in the wiki code
  • Rebuild the upstream picture URL (and download the picture)
  • Modify the whole thumbnail HTML/CSS code to get it right

Do I get that right? Because that sounds just very difficult and error prone to do... in particular why Wikimedia could not generate proper HTML (respecting both the given wiki code AND the mobile constraints) in a first place? More or less like before?

The more fundamental question is whether or not kiwix wants to be fetching the "mobile HTML" in the first place, as opposed to the HTML used for desktop browsing.

We want to serve our users properly... and the world goes mobile... Wikimedia goes mobile... Why should Kiwix do differently?

@cscott
Copy link
Contributor

cscott commented Dec 21, 2023

Well, in this case it's because the "mobile HTML" is optimized for an entirely different use case than yours -- one where we want to save bandwidth by deferring the loading of images as long as possible.

All the things you are asking for are present in the standard ("non-mobile") Parsoid HTML.

@kelson42
Copy link
Collaborator

kelson42 commented Dec 21, 2023

@cscott The lazy loading is a non-topic here. Can we focus on the bad sizes of the images which IS the problem.

@kelson42 kelson42 modified the milestones: 1.15.0, 1.14.0 May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants