Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images missing even when under cutoff value #70

Closed
Popolechien opened this issue Dec 17, 2020 · 5 comments
Closed

Images missing even when under cutoff value #70

Popolechien opened this issue Dec 17, 2020 · 5 comments
Assignees

Comments

@Popolechien
Copy link
Contributor

I just zimmed up a wordpress blog with 186 articles (cutoff at 1,000) and about 500 images (https://mesquartierschinois.wordpress.com). Standard, free wordpress, ie no funky extension added.

I would say 10-20% of images are still missing.

@kelson42 kelson42 pinned this issue Dec 17, 2020
@kelson42
Copy link
Contributor

I believe I can confirm the problem, with latest wiki.openzim.org scrape, I had myself a few images missing.

@rgaudin
Copy link
Member

rgaudin commented Dec 17, 2020

Please share the ZIM and indications of where to find such images so we can look at what's special about them. Task ID or link as well so we can check the logs.

@Popolechien
Copy link
Contributor Author

https://farm.youzim.it/pipeline/d1c2f201514f3da67f887df5 for the task - images are missing on every second page or so.

@rgaudin
Copy link
Member

rgaudin commented Jan 13, 2021

OK, I've looked into this. It is also related to image's srcset but it's not fixed by #63 (which added support for them).

What we're seeing here is that the crawler is not making requests for all of the images in the srcset (or those fail to complete) ; so there are missing images. Depending on the one your browser picks (kinda hard to predict but you can inspect what its trying to display) you may got one that was crawled or not.

I've also found that it's sort of random in selecting which image gets crawled…

I've opened a ticket upstream: webrecorder/browsertrix-crawler#3

@kelson42 should we keep that open until it gets solved upstream?

@rgaudin rgaudin closed this as completed Jan 13, 2021
@rgaudin rgaudin self-assigned this Jan 13, 2021
@rgaudin rgaudin reopened this Jan 13, 2021
@kelson42
Copy link
Contributor

@rgaudin Yes, please keep this ticket open please.

@rgaudin rgaudin unpinned this issue Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants