Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New request: womenshistory.si.edu #1121

Open
benoit74 opened this issue Jul 24, 2024 · 9 comments
Open

New request: womenshistory.si.edu #1121

benoit74 opened this issue Jul 24, 2024 · 9 comments
Assignees
Labels
Bug Something isn't working Upstream For tickets which are waiting for an upstream modification (typically scrapper or target website) Zimit

Comments

@benoit74
Copy link
Contributor

This is a "fake" new request to track the fact that we (@kelson42 and I) are building a demo zimit ZIM of womenshistory.si.edu ; this issue will also help to track choices we've made in the recipe configuration

  • Website URL: https://womenshistory.si.edu
  • License: Non-commercial usage is allowed if we cite the source / link to website, which will be the case somehow: https://www.si.edu/Termsofuse
  • Desired ZIM Title: American Women’s History
  • Desired ZIM Description: Bringing American Women’s History into Focus
  • Desired ZIM Icon –png (URL or attach one): website favicon is perfect
  • Language (ISO 639-3): eng
  • Is this a MediaWiki?: no
@benoit74
Copy link
Contributor Author

benoit74 commented Jul 24, 2024

Recipe URL : https://farm.openzim.org/recipes/womenshistory.si.edu

Exclude so far: https?:\/\/womenshistory\.si\.edu(?:\/es\/|\/contact-us|\/search|\/object|.*\?.*edan_fq) ; goal is to exclude:

  • contact-us page which is useless
  • search pages which are not going to work
  • object pages which are way too numerous, at least for now
  • es (spanish) pages which we do not want to include and are mostly not working (yet at least) on the website
  • any URL containing edan_fq: this is a trick to remove facet search pages e.g. on https://womenshistory.si.edu/exhibitions ; not sure this is going to be future-proof, but did not find a better solution (yet)

@benoit74
Copy link
Contributor Author

ZIM seems to be pretty OK on dev library: https://dev.library.kiwix.org/viewer#womenshistory.si.edu_en_all_2024-07

Only significant concern I've found so far is that Youtube videos are not present, see e.g. https://dev.library.kiwix.org/viewer#womenshistory.si.edu_en_all_2024-07/womenshistory.si.edu/blog/gold-standard-how-these-iconic-olympic-athletes-inspired-and-united-us. It looks like they have not even been fetched by the crawler, but even if they were I think it would not work because they are embedded with a special url like https://womenshistory.si.edu/media/oembed?url=https%3A//www.youtube.com/watch%3Fv%3D6l7OxP67XSc&max_width=1280&max_height=720&hash=ZPHuxNt5R3L87vqLbN-Ub0XypraFbUX0cASUJv_mTjg which itself embed the Youtube player iframe. This looks like specially crafted backend URL, looks like we could rewrite it directly to the youtube fuzzy replay URL, tbc.

@benoit74
Copy link
Contributor Author

Problem of Youtube tracked upstream now: openzim/zimit#360

@benoit74 benoit74 added Bug Something isn't working Upstream For tickets which are waiting for an upstream modification (typically scrapper or target website) labels Jul 24, 2024
@Popolechien
Copy link
Collaborator

@Popolechien
Copy link
Collaborator

Well technically I see a tiny bit showing up
image

@benoit74
Copy link
Contributor Author

Well spotted @Popolechien ! Looking at the HTML of this page, upstream issue is most probably openzim/warc2zim#293

@benoit74
Copy link
Contributor Author

Bug on "mrs-nixon" page is in fact a bit different, I've opened a dedicated issue: openzim/warc2zim#364

@benoit74
Copy link
Contributor Author

Youtube issue is in fact in warc2zim: openzim/warc2zim#316

@benoit74
Copy link
Contributor Author

ZIM is more or less fixed, at least Youtube videos work (don't really know why) at https://mirror.download.kiwix.org/zim/.hidden/dev/womenshistory.si.edu_en_all_2024-08.zim

Still multiple issues are visible when you wander around the ZIM (e.g. blog pages are not displaying ... but looks like this is mostly a JS issue of the original website trying to make a POST request when we click the button ... not easily fixable)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Upstream For tickets which are waiting for an upstream modification (typically scrapper or target website) Zimit
Projects
None yet
Development

No branches or pull requests

3 participants