Assets inside `<noscript>` tags are not fetched when running a zimit scrape #121

Jaifroid · 2022-05-07T15:48:18Z

I was testing the Docker image of openzin/zimit on a Wikivoyage page (one that happens to be missing from our Wikivoyage ZIM, but that is immaterial). Due to the image lazy-loading script employed by Wikimedia web sites, all images other than the first one on a page are enclosed in <noscript>...</noscript> tags. Furthermore, there is a placeholder <span> tag used to reserve space for the image.

The Zimit scraper implemented here does not scrape assets enclosed in the <noscript> tags. And if the lazy load script runs inside the ZIM, it attempts to load the image from the online server instead of from the ZIM (but the asset is simply not in the ZIM in any case).

I think it's immaterial that we have proper ZIMs for Wikimedia sites, as many other sites use these lazy loading techniques, and they seem to defeat the scraper used by Zimit.

Limitation: I ran the scrape locally with a limit of 100 pages, but I don't think that affects the pulling of assets from a particular page.

The text was updated successfully, but these errors were encountered:

Jaifroid · 2022-05-07T16:17:04Z

On further testing, I'm not sure this is 100% true. Some of these images appear to have been scraped, but their src as given in the ZIM does not appear to correspond to the file's title (after transforming absolute references to ZIM URLs). This needs further investigation.

Jaifroid · 2022-05-07T17:39:39Z

I've determined that this issue is still valid. The scraped ZIM contains very similar filenames that are scraped from the Wikimedia Commons File Entry for the image (since each image in Wikimedia ZIMs is hyperlinked to a File:[name] entry on Wikimedia Commons). But the filenames are different sizes of the image. The images themselves that are inside <noscript> blocks are not scraped UNLESS there is a corresponding image size entry on the Wikimedia Commons File:[name] page.

Jaifroid · 2022-05-07T17:40:34Z

A further complication arises with the way filenames are encoded, but that is a separate issue.

kelson42 · 2022-05-07T20:04:43Z

@Jaifroid Do you have an exact command to run as example?

Jaifroid · 2022-05-08T07:39:34Z

Yes, I used the following (output path needs tweaking for user's context):

docker run -v C:\Users\jaifroid\Source\Repos\zimit\:/output -w /output --cap-add=SYS_ADMIN --cap-add=NET_ADMIN --shm-size=1gb openzim/zimit zimit --url https://en.m.wikivoyage.org/wiki/Cambridge --name myzimfile --workers 2 --limit 100 --scroll

Only the first two main images of the landing page are in the ZIM. The ZIM contains closely related (but inexactly named) image files for (some of) the missing images because it has scraped the corresponding Wikimedia Commons 'File:...' page for each image.

I think the zimit --scroll is designed to overcome precisely this issue, but it doesn't seem to work. The documentation on its use is ambiguous. Running zimit --help we get:

--scroll If set, will autoscroll to bottom of the page

But the readme on GitHub says:

--scroll [N] - if set, will activate a simple auto-scroll behavior on each page to scroll for upto N seconds

Unfortunately, passing a number as N (either as --scroll 5 or as --scroll [5], though I presume the brackets just mean the value is optional) results in a warc2zim error FileNotFoundError: [Errno 2] No such file or directory: '5'. Clearly zimit is erroneously passing the optional value to warc2zim, which doesn't know what to do with it.

rgaudin · 2022-05-08T11:29:35Z

zim it is not using the latest version of the crawler. Maybe this option changed?
We're waiting for a new pylibzim release to update zimit to latest crawler and replayer.

stale · 2022-07-10T14:08:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

rgaudin · 2022-08-08T07:52:06Z

--scroll option has been removed.

Jaifroid · 2022-08-08T08:05:22Z

So I guess we need to test again whether the issue persists.

kelson42 · 2023-04-24T09:12:37Z

@Jaifroid Do you still have this issue with the latest 1.3.1

Jaifroid · 2023-04-24T09:33:44Z

I'll test and report back.

Jaifroid · 2024-01-02T10:30:30Z

As I haven't seen more instances of this, I'll close this issue. I think lazy loading is properly handled nowadays, though I never re-tested the specific case, and probalby won't have time to.

stale bot added the stale label Jul 10, 2022

rgaudin closed this as completed Aug 8, 2022

rgaudin reopened this Aug 8, 2022

stale bot removed the stale label Aug 8, 2022

kelson42 added bug question labels Apr 24, 2023

kelson42 added this to the 2.0.0 milestone Apr 24, 2023

kelson42 assigned Jaifroid Apr 24, 2023

Jaifroid closed this as completed Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assets inside `<noscript>` tags are not fetched when running a zimit scrape #121

Assets inside `<noscript>` tags are not fetched when running a zimit scrape #121

Jaifroid commented May 7, 2022

Jaifroid commented May 7, 2022

Jaifroid commented May 7, 2022

Jaifroid commented May 7, 2022

kelson42 commented May 7, 2022

Jaifroid commented May 8, 2022

rgaudin commented May 8, 2022

stale bot commented Jul 10, 2022

rgaudin commented Aug 8, 2022

Jaifroid commented Aug 8, 2022

kelson42 commented Apr 24, 2023

Jaifroid commented Apr 24, 2023

Jaifroid commented Jan 2, 2024

Assets inside <noscript> tags are not fetched when running a zimit scrape #121

Assets inside <noscript> tags are not fetched when running a zimit scrape #121

Comments

Jaifroid commented May 7, 2022

Jaifroid commented May 7, 2022

Jaifroid commented May 7, 2022

Jaifroid commented May 7, 2022

kelson42 commented May 7, 2022

Jaifroid commented May 8, 2022

rgaudin commented May 8, 2022

stale bot commented Jul 10, 2022

rgaudin commented Aug 8, 2022

Jaifroid commented Aug 8, 2022

kelson42 commented Apr 24, 2023

Jaifroid commented Apr 24, 2023

Jaifroid commented Jan 2, 2024

Assets inside `<noscript>` tags are not fetched when running a zimit scrape #121

Assets inside `<noscript>` tags are not fetched when running a zimit scrape #121