Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TED is not pushing 6 big topics anymore #150

Closed
benoit74 opened this issue Dec 12, 2023 · 10 comments
Closed

TED is not pushing 6 big topics anymore #150

benoit74 opened this issue Dec 12, 2023 · 10 comments
Assignees

Comments

@benoit74
Copy link
Collaborator

We previously took the decision to create one ZIM for each six topics pushed by TED on its website (Business, Design, Entertainment, ...)

These six topics are not pushed forward anymore (at least not anymore on the front page), and there are now many many topics.

Given the fact that the scraper functionality to create ZIMs by topics is broken (see #149), we wonder if it makes sense to continue on this strategy or adopt a new one.

Some remarks:

@Popolechien @RavanJAltaie we need your help on this

@Popolechien
Copy link

These six topics are not pushed forward anymore (at least not anymore on the front page), and there are now many many topics.

@benoit74 Do you mean that the topics themselves have been deprecated and replaced by others, or that new ones have been added and out of these, 10 are featured on the font page (but there could be 16 or 20 topics in all)?

@benoit74
Copy link
Collaborator Author

Topics are not presented at all anymore in the front page (they were if I got @rgaudin remark correctly).
And they are not much emphasized in the talks page mentioned.

Regarding which topics are available, you can research this on your own, they are displayed in the talks page.

"Legacy" topics were Business, Design, Entertainment, Global Issues, Science, Technology.

"New main" topics are AI, Business, Communication, Education, Health, Language, Leadership, Mental Health, Motivation, Personal Growth, Psychology, Sleep, Sports, TED-ed

So it looks like "legacy" topics are not that important anymore. But they are all still available (via the "See all" button).

In all there are like hundreds of topics (again, "See all", but I did not counted them ^^).

@Popolechien
Copy link

We decided that at the end of the day we would like to capture these new topics. However, since the scraper is out of commission for the time being there is no point in creating the corresponding recipes. We can keep the existing files for the time being, they're still very watchable, but the TED scraper ultimately needs a fix (ideally one that also captures new topics when/if they are created).

A single mega-zim without the --topic parameter (and therefore taking all 6'000+ videos) would probably be unmanageable / undownloadable / unwatchable.

@benoit74
Copy link
Collaborator Author

Do you mean that you will capture the 100+ topics in 100+ ZIMs?

For the scraper, there is no difference between old and new topics, they are just topics. So once fix, it will be possible to scrape any topic. And I think that just like we decided for youtube very recently, the ted scraper will not create one ZIM per topic automatically, because otherwise it is a mess to set ZIMs metadata (title, description, long description, ...)

@Popolechien
Copy link

Do you mean that you will capture the 100+ topics in 100+ ZIMs?

Likely yes, until we find a better way.

@benoit74
Copy link
Collaborator Author

Same remark as openzim/zimfarm#878 (comment), how do you plan to create 100+ recipes manually? And maintain them manually on the long term? Be informed of new topics which will obviously appear at some point?

In addition to this burden for the content team to create and maintain 100+ recipes, I'm also not convinced because impact on storage is not negligible. As already mentioned most videos are present in many topics, so a single video will be stored in multiple ZIM, hence increasing storage space. This is a concern for us (but we could say that we don't mind and will pay for it), but also (and probably even more importantly) a concern for ZIM users who won't have a convenient way to download a collection of TED ZIMs (or even the whole TED collection) without wasting storage space on their devices. I don't have a solution to this concern yet, but I feel that answering "let's create 100+ ZIMs" is not a realistic solution.

@Popolechien
Copy link

Well ideally once the recipes are created maintenance should be fairly light, if at all needed except once in a blue moon when everything breaks at once and we are forced to have these discussions. Being informed of new topics is indeed an question for which we have no answer, which is why I wrote that there should be an actual project to manage all these issues and more.

In the meantime, we are in the process of providing educational content to users (TED is quite popular) and have no way to know whether their interests overlap across topics (ie would someone interested in business also want to hear about capitalism? but what if they are not?) It is for them to decide, and for us to provide them with the best info possible to know what's in the tin. The problem here is that TED itself does not provide any specific description to its topical playlists: the question therefore is do we want to do it for them? If not, then can we script the creation of recipes so as to mimick what we had until now (since the need for human input when creating recipes will be reduced to a minimum)?

Storage on our side does not seem to be an issue: the largest TED zim we have is Technology, with ca. 1,200 videos and 50Gb . We currently have 606 individual TED zim files (the whole directory is probably in the 700Gb range), so going down to 100 or so may show some benefit as we will reduce the number of duplicates.

@benoit74
Copy link
Collaborator Author

OK, this makes sense.
Did I get you right that once all topics ZIMs have been created, we will delete the 606 individual TED ZIM files?

@Popolechien
Copy link

This is correct.

@benoit74
Copy link
Collaborator Author

OK, so it is quite clear now how to move this forward, we first need to fix the scraper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants