-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assets inside <noscript>
tags are not fetched when running a zimit scrape
#121
Comments
On further testing, I'm not sure this is 100% true. Some of these images appear to have been scraped, but their |
I've determined that this issue is still valid. The scraped ZIM contains very similar filenames that are scraped from the Wikimedia Commons File Entry for the image (since each image in Wikimedia ZIMs is hyperlinked to a File:[name] entry on Wikimedia Commons). But the filenames are different sizes of the image. The images themselves that are inside |
A further complication arises with the way filenames are encoded, but that is a separate issue. |
@Jaifroid Do you have an exact command to run as example? |
Yes, I used the following (output path needs tweaking for user's context):
Only the first two main images of the landing page are in the ZIM. The ZIM contains closely related (but inexactly named) image files for (some of) the missing images because it has scraped the corresponding Wikimedia Commons 'File:...' page for each image. I think the zimit
But the readme on GitHub says:
Unfortunately, passing a number as N (either as |
zim it is not using the latest version of the crawler. Maybe this option changed? |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
|
So I guess we need to test again whether the issue persists. |
@Jaifroid Do you still have this issue with the latest |
I'll test and report back. |
As I haven't seen more instances of this, I'll close this issue. I think lazy loading is properly handled nowadays, though I never re-tested the specific case, and probalby won't have time to. |
I was testing the Docker image of openzin/zimit on a Wikivoyage page (one that happens to be missing from our Wikivoyage ZIM, but that is immaterial). Due to the image lazy-loading script employed by Wikimedia web sites, all images other than the first one on a page are enclosed in
<noscript>...</noscript>
tags. Furthermore, there is a placeholder<span>
tag used to reserve space for the image.The Zimit scraper implemented here does not scrape assets enclosed in the
<noscript>
tags. And if the lazy load script runs inside the ZIM, it attempts to load the image from the online server instead of from the ZIM (but the asset is simply not in the ZIM in any case).I think it's immaterial that we have proper ZIMs for Wikimedia sites, as many other sites use these lazy loading techniques, and they seem to defeat the scraper used by Zimit.
Limitation: I ran the scrape locally with a limit of 100 pages, but I don't think that affects the pulling of assets from a particular page.
The text was updated successfully, but these errors were encountered: