Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assets inside <noscript> tags are not fetched when running a zimit scrape #121

Closed
Jaifroid opened this issue May 7, 2022 · 12 comments
Closed
Assignees
Milestone

Comments

@Jaifroid
Copy link

Jaifroid commented May 7, 2022

I was testing the Docker image of openzin/zimit on a Wikivoyage page (one that happens to be missing from our Wikivoyage ZIM, but that is immaterial). Due to the image lazy-loading script employed by Wikimedia web sites, all images other than the first one on a page are enclosed in <noscript>...</noscript> tags. Furthermore, there is a placeholder <span> tag used to reserve space for the image.

The Zimit scraper implemented here does not scrape assets enclosed in the <noscript> tags. And if the lazy load script runs inside the ZIM, it attempts to load the image from the online server instead of from the ZIM (but the asset is simply not in the ZIM in any case).

I think it's immaterial that we have proper ZIMs for Wikimedia sites, as many other sites use these lazy loading techniques, and they seem to defeat the scraper used by Zimit.

Limitation: I ran the scrape locally with a limit of 100 pages, but I don't think that affects the pulling of assets from a particular page.

@Jaifroid
Copy link
Author

Jaifroid commented May 7, 2022

On further testing, I'm not sure this is 100% true. Some of these images appear to have been scraped, but their src as given in the ZIM does not appear to correspond to the file's title (after transforming absolute references to ZIM URLs). This needs further investigation.

@Jaifroid
Copy link
Author

Jaifroid commented May 7, 2022

I've determined that this issue is still valid. The scraped ZIM contains very similar filenames that are scraped from the Wikimedia Commons File Entry for the image (since each image in Wikimedia ZIMs is hyperlinked to a File:[name] entry on Wikimedia Commons). But the filenames are different sizes of the image. The images themselves that are inside <noscript> blocks are not scraped UNLESS there is a corresponding image size entry on the Wikimedia Commons File:[name] page.

@Jaifroid
Copy link
Author

Jaifroid commented May 7, 2022

A further complication arises with the way filenames are encoded, but that is a separate issue.

@kelson42
Copy link
Contributor

kelson42 commented May 7, 2022

@Jaifroid Do you have an exact command to run as example?

@Jaifroid
Copy link
Author

Jaifroid commented May 8, 2022

Yes, I used the following (output path needs tweaking for user's context):

docker run -v C:\Users\jaifroid\Source\Repos\zimit\:/output -w /output --cap-add=SYS_ADMIN --cap-add=NET_ADMIN --shm-size=1gb openzim/zimit zimit --url https://en.m.wikivoyage.org/wiki/Cambridge --name myzimfile --workers 2 --limit 100 --scroll

Only the first two main images of the landing page are in the ZIM. The ZIM contains closely related (but inexactly named) image files for (some of) the missing images because it has scraped the corresponding Wikimedia Commons 'File:...' page for each image.

I think the zimit --scroll is designed to overcome precisely this issue, but it doesn't seem to work. The documentation on its use is ambiguous. Running zimit --help we get:

--scroll If set, will autoscroll to bottom of the page

But the readme on GitHub says:

--scroll [N] - if set, will activate a simple auto-scroll behavior on each page to scroll for upto N seconds

Unfortunately, passing a number as N (either as --scroll 5 or as --scroll [5], though I presume the brackets just mean the value is optional) results in a warc2zim error FileNotFoundError: [Errno 2] No such file or directory: '5'. Clearly zimit is erroneously passing the optional value to warc2zim, which doesn't know what to do with it.

@rgaudin
Copy link
Member

rgaudin commented May 8, 2022

zim it is not using the latest version of the crawler. Maybe this option changed?
We're waiting for a new pylibzim release to update zimit to latest crawler and replayer.

@stale
Copy link

stale bot commented Jul 10, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Jul 10, 2022
@rgaudin
Copy link
Member

rgaudin commented Aug 8, 2022

--scroll option has been removed.

@rgaudin rgaudin closed this as completed Aug 8, 2022
@rgaudin rgaudin reopened this Aug 8, 2022
@stale stale bot removed the stale label Aug 8, 2022
@Jaifroid
Copy link
Author

Jaifroid commented Aug 8, 2022

So I guess we need to test again whether the issue persists.

@kelson42 kelson42 added this to the 2.0.0 milestone Apr 24, 2023
@kelson42
Copy link
Contributor

@Jaifroid Do you still have this issue with the latest 1.3.1

@Jaifroid
Copy link
Author

I'll test and report back.

@Jaifroid
Copy link
Author

Jaifroid commented Jan 2, 2024

As I haven't seen more instances of this, I'll close this issue. I think lazy loading is properly handled nowadays, though I never re-tested the specific case, and probalby won't have time to.

@Jaifroid Jaifroid closed this as completed Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants