Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set ZIM description properly in case of multiple playlist scraping #147

Closed
kelson42 opened this issue Jul 4, 2021 · 17 comments
Closed

Set ZIM description properly in case of multiple playlist scraping #147

kelson42 opened this issue Jul 4, 2021 · 17 comments

Comments

@kelson42
Copy link
Contributor

kelson42 commented Jul 4, 2021

Currently it seems to be a - character like at http://library.kiwix.org/khan-academy-videos_en_geometric-optics-ap-physics-2-khan-academy_2021-04/M/Description

@rgaudin
Copy link
Member

rgaudin commented Jul 4, 2021

this is a local host link...

@rgaudin
Copy link
Member

rgaudin commented Jul 4, 2021

Do tou mean the description is simply "-" ?

@rgaudin
Copy link
Member

rgaudin commented Jul 4, 2021

When building for a single playlist, scraper uses this playlist's description as zim description. When it's several playlists, it's not possible to build anything meaning full so it renders as -. Of course, it's up to the scraper user to provide a meaningful one.

If you want to replace that placeholder with another one, please suggest something.

@rgaudin rgaudin removed the bug label Jul 4, 2021
@kelson42
Copy link
Contributor Author

kelson42 commented Jul 4, 2021

@rgaudin If there is no description, it should be empty or not even existing metadata. But not -.
@Popolechien I had the feeling after our discussion, that it was possible to scrape something meaningfull?

@rgaudin
Copy link
Member

rgaudin commented Jul 4, 2021

@rgaudin If there is no description, it should be empty or not even existing metadata. But not -.

Contradicts the “we should always provide a Description”. I think the - incentivizes the creator into changing it but I'm fine either way.

@Popolechien I had the feeling after our discussion, that it was possible to scrape something meaningfull?

How would that be possible? Playlists can have nothing in common. You can choose to use the first playlist description, that's meaningful but misleading or you can build another placeholder like “2 playlists from Youtube” or something but that's not meaningful.

We are talking default behavior here, scraper should release users from having to input stuff that is available but it's not its purpose to do their jobs completely. Building a several playlists Zim is very handy but providing a description for this Zim should be up to the user.

@Popolechien
Copy link

@Popolechien I had the feeling after our discussion, that it was possible to scrape something meaningful?

I don't remember the discussion, maybe that was on another topic? I checked and playlists on YT do not have specific descriptors, which is a bummer. Maybe something more generic like "A playlist from the XYZ Youtube channel" would be a little better, but that's as far as we could go.

@rgaudin
Copy link
Member

rgaudin commented Jul 5, 2021

Maybe something more generic like "A playlist from the XYZ Youtube channel" would be a little better, but that's as far as we could go.

@Popolechien when we have a single playlist, we already have something: the channel's description I believe. This ticket is for ZIM of multiple playlists.

@Popolechien
Copy link

This ticket is for ZIM of multiple playlists.

@rgaudin Yes I understand. I agree that just "-" does not provide enough information at the moment if the playlist title is too generic or too specific. Channel descriptions are written for people who know already they're on Youtube (as opposite to browsing a random list of contents from an app's library). Knowing that the content comes from Youtube and is a playlist is already actionable information for the user IMHO.

@rgaudin
Copy link
Member

rgaudin commented Jul 5, 2021

Totally agrees. Please propose a way forward for all the use cases that are not good enough. Otherwise this ticket is not fixable.

@Popolechien
Copy link

Then let's set the default zim description to A playlist from the Channel name Youtube channel

@stale
Copy link

stale bot commented Sep 19, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@kelson42
Copy link
Contributor Author

kelson42 commented Nov 26, 2023

We need to have a proper description in the catalogue and this is why we need the Metadata Description in the ZIM.

BTW, the specification does not explicitly say that empty-string is forbidden... Not sure this is a caveat or a feature... But we want to have a ZIM description.

I'm not in favour of doing something generic like - or A playlist from the Channel name Youtube channel... this brings almost not information. This is a workaround, not a solution.

Therefore there is not IMO thousand possibilities:

  • We wait the CMS to be able to handle the problem... but ultimatively the CMS should not be there to compensate scraper weakness
  • We implement a way to specify a Description for each Playlist in such a scenario
  • We forbid the scraper to create automatically so many ZIM files
  • We keep the feature in the scraper but not in the Zimfarm

IMHO the last solution is the appropriate one, we should create one recipe per playlist and specify clearly the description. I prefer to have less content but better quality content.

@benoit74
Copy link
Collaborator

IMHO, I see significant advantages in the idea of removing the "Playlist mode" from the Zimfarm:

  • from recent interactions with @RavanJAltaie and @Popolechien, this mode is causing quite a lot of confusion:
    • confusion between "Playlist mode" and list of playlists in a single ZIM
    • confusion between which parameters have to be set
    • confusion about how to use placeholders
    • Note that I'm not saying that we can't continue our efforts to have a good level of understanding of these settings, but clearly there is a price to pay even for someone which is not new to the team (so once the team will grow to new teammates, price will somehow have to be paid again)
  • with "Playlist mode", it is not possible to spread the workload across workers and we get very long running recipes which are hard to manage (e.g. we are very reluctant to stop them once started should we need to)
  • it will force us to really tackle some disadvantages listed below which have only been circumvented by the "Playlists mode" which could be seen as a tweak
    • creating many ZIMs in one zimfarm run should probably limited to cases where a significant amount of data is shared and it would be a shame to retrieve this many times for many ZIMs to create

The main disadvantages are:

  • it is a significant task to create 10s or 100s of recipe by hand without making an error (again, mass view / edit would probably help here)
  • it will become hard to maintain coherency between the 10s or 100s of recipe of the same Youtube channel
    • this is most probably just a good indicator that we need a feature to compare recipe settings and/or mass edit them in the Zimfarm, this is not the first time we encounter this need (e.g. API keys, docker image tag)
  • it is hard (impossible?) to detect new playlists and hence new recipes that will have to be created as the Youtube channel evolve
    • tooling around this is probably quite easy to develop
    • mostly true as well for playlists which have been removed (but here the recipe will fail so it is a bit easier)

@benoit74
Copy link
Collaborator

And this issue focused so far on the description, but same if true for other metadata like name, filename, title even maybe tags, icon (with varying degree of customization needed of course)

@rgaudin
Copy link
Member

rgaudin commented Nov 27, 2023

I also think of current playlists-mode usage as lazy mode and believe we'd improve the ZIM quality by not using it in the Zimfarm. The load ventilation argument is also very valid.

That said, I'd just remove support for it in the farm and keep the feature in the scraper.

@kelson42
Copy link
Contributor Author

kelson42 commented Dec 3, 2023

I have open the ticket on Zimfarm, anything else we could do here with this issue?

@Popolechien
Copy link

Nope. I'm late to the party but I agree with the idea of removing the multiple playlists mode from the farm for the time being.

@kelson42 kelson42 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 6, 2023
@kelson42 kelson42 unpinned this issue Aug 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants