Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ability for youtube scraper to create multiple ZIM (one per playlist) #878

Closed
kelson42 opened this issue Dec 3, 2023 · 25 comments · Fixed by #974
Closed

Remove ability for youtube scraper to create multiple ZIM (one per playlist) #878

kelson42 opened this issue Dec 3, 2023 · 25 comments · Fixed by #974
Assignees

Comments

@kelson42
Copy link
Contributor

kelson42 commented Dec 3, 2023

See openzim/youtube#147.

Recipes using this configuration should be listed and a migration scenario should be decided first.

@benoit74
Copy link
Collaborator

@Popolechien @RavanJAltaie

What do you want to do with these recipes which have the playlist mode enabled?

aimhi_playlists
dse_ladakh_lbj_playlists
keylearning_en
khan-videos_ar_playlists
khan-videos_bn_playlists
khan-videos_en_playlists
khan-videos_es_playlists
khan-videos_fr_playlists
khan-videos_tr_playlists
madrasa_ar_playlists
project-fuel
ruangguru_id_playlists
scienceinthebath_playlist
slam-out-loud_hi
tutorial-wikipedia
ubongo_sw
voa_learning_english_all_playlists
zenius_id_playlists

@Popolechien
Copy link
Contributor

@benoit74 Let us go over them later today and revert back.

@Popolechien
Copy link
Contributor

Popolechien commented Dec 12, 2023

Add Canadian prepper to the list of zim files that use multiple playlists and need updating, but basically we'll recreate them all with one playlist per recipe / zim file (except Khan, zenius and ruangguru).

@benoit74
Copy link
Collaborator

Canadian prepper has no problem, it is creating one ZIM from a bunch of playlists. What we want to deactivate/remove is when you want to automatically create one ZIM per playlist in the channel / user. Except if it is indeed badly configured and what you wanted is multiple ZIMs, but that's another story.

How should we move forward on this? Should we wait for you to recreate all recipes, or can we simply delete them and you will recreate them on the fly (deleting the recipe will not remove the ZIMs anyway)? Do you need a configuration export so that it will be easier to recreate?

Do you plan to create all recipes manually?

What do you mean by "except Khan, zenus and ruangguru" ? Do you plan to simply delete these recipes and not create anymore ZIMs for these ones?

@Popolechien
Copy link
Contributor

What we want to deactivate/remove is when you want to automatically create one ZIM per playlist in the channel / user.

Ok this I had misunderstood. If there is a way to easily export/duplicate the recipes then by all means let's do it, but otherwise we'll need to recreate them manually (and then only delete the original ones).

Khan, zenius and ruangguru will be deleted entirely (recipes AND zim files).

@benoit74
Copy link
Collaborator

OK, thank you.

Export is easy, it will just be a "raw" copy of the configuration, just so that you have a reference of the old configuration. E.g. for aimhi_playlists (I redacted the secrets), I will export the configuration below and then delete the recipe:

 {
	"api-key": "**********",
	"concurrency": 1,
	"debug": true,
	"format": "webm",
	"id": "PLr5n3ojAJWjSVnG_EK1xF3rW1Lo0N2qwA,PLr5n3ojAJWjQEiRIuHlRoN7rBDKG6GbvH,PLr5n3ojAJWjRGQ1DnnIqDrIXKuuNydGid,PLr5n3ojAJWjTkqDW49ew1u7vsbVzM5Uub,PLr5n3ojAJWjSVh9mgusLb6npGNnFif_Sw,PLr5n3ojAJWjRSuu4s5Vu1CEN0rkgXZ-VA,PLr5n3ojAJWjRDmRmIVAsD4MSEt7wMSOGr,PLr5n3ojAJWjSNVp6jrlwXPz5MyFArv6oO,PLr5n3ojAJWjRiuPrUAAveWrrqnNPPNzYK",
	"indiv-playlists": true,
	"language": "eng",
	"low-quality": true,
	"main-color": "#FFFFFF",
	"optimization-cache": "https://s3.us-west-1.wasabisys.com/?keyId=*****&secretAccessKey=*****&bucketName=org-kiwix-youtube",
	"output": "/output",
	"playlists-description": "The nature-first, curiosity-powered online school for ages 8-18",
	"playlists-name": "aimhi_en_-{title}",
	"playlists-title": "AimHi",
	"playlists-zim-file": "aimhi_en_{slug}_{period}",
	"tags": "aimhi",
	"tmp-dir": "/output",
	"type": "playlist"
}

Is this useful?

Duplicate is something you already have with the "Clone" button. But you still have to input everything else.

From my PoV, this last remark emphasis that:

  • you do not have a convenient way to create/edit recipes in batch
  • you do not have a convenient way to get the list of playlists in a given channel / user
  • you do not have a convenient way to be informed when a new playlist is created
  • you do not have a convenient way to be informed when a playlist is deleted (the recipe will fail, but you have to check it manually)

I don't know if we should live with it, try some quick and dirty wins on some of these topics, or implement a real solution.

@Popolechien
Copy link
Contributor

It is probably for @RavanJAltaie to decide how she wants to proceed, but I'm not sure the export is really useful as neither her nor myself have the skills to create the new recipes via script. Our last discussion was to clone existing recipes (in which case (I'd delete them after the deed is done).

As for next steps, the quick and dirty tends to be somewhat permanent in this house and not exactly convenient for the non-dev end user either: I suggest we park it until this becomes a real project.

@benoit74
Copy link
Collaborator

OK, so next steps before I can start to work on this issue are:

  • Ravan creates all needed recipes without the "playlists mode" (again, putting multiple playlists in one single ZIM like Canadian Prepper is OK)
  • Ravan deletes all recipes using the "playlists mode"

Correct? Note that I'm not speaking about the deletion of unwanted ZIMs, since there is no dependency AFAIK and we can do it at any time, at your own convenience

I'm waiting for your GO to perform the last step which consist in removing the ability to use the "playlists mode" in Zimfarm

@Popolechien
Copy link
Contributor

I realize that @RavanJAltaie was not on the thread and missed that part. I've assigned her now so she can confirm to you when all new recipes have been created

@RavanJAltaie
Copy link

Now I'm confused, recipes with multiple playlists are ok? the only problem is deactivating playlist mode? @benoit74

@benoit74
Copy link
Collaborator

@RavanJAltaie Yes, you are right. Recipes with multiple playlists in one ZIM are OK.

@benoit74
Copy link
Collaborator

and yes we just want to get rid of the playlist mode

@RavanJAltaie
Copy link

All fixed successfully!

@benoit74
Copy link
Collaborator

Great, thank you!

I reopen the issue because I still have my part of the job to do (remove ability to create youtube recipes which will create multiple ZIMs at once)

@benoit74 benoit74 reopened this Jan 31, 2024
@benoit74
Copy link
Collaborator

@RavanJAltaie I'm sorry but madrasa_ar_playlists is still using the Playlists mode, please fix it before I can proceed.

@rgaudin
Copy link
Member

rgaudin commented Jan 31, 2024

It's not clear from this ticket what happened exactly and what will happen:

  • “Fixed successfully” is mentioned but I understand it's a per-recipe question: some being changed to disable playlists mode, new recipes to be created, some to be deleted.
  • I can't find the recipes for aimhi_playlists in the zimfarm.
  • The ZIM is still in the library

@RavanJAltaie
Copy link

All fixed.

@benoit74 benoit74 reopened this Mar 5, 2024
@benoit74
Copy link
Collaborator

benoit74 commented Mar 5, 2024

Again, I still have my part of the job to do

@benoit74
Copy link
Collaborator

benoit74 commented Mar 5, 2024

@RavanJAltaie could you please detail recipe per recipe of #878 (comment) what has been done?

I had a quick look and it seems that in many cases, you simply removed the playlist mode and created one big ZIM instead of many small ones, is this correct? The only exception is madrasa?

When you used this "create only one ZIM instead of many small ones" approach, it looks like you kept the old small ZIMs in the library, is this intentional? Content is evergreen so we do not mind to keep them in the library and not update them anymore?

I'm not convinced by this strategy, usually there was only 5/6 playlists and it did not looked like the number of playlists was frequently updated. Small ZIMs are usually more practical for our users. For https://farm.openzim.org/recipes/voa_learning_en_all for instance, we moved from ZIM ranging from 59.48 MB to 12.72 GB to one enormous (from my perspective at least) 24.93G ZIM. But maybe users are always downloading all ZIMs, so the extra work to create individual ZIMs is not worth it. It is just that this decision is very opaque and has not been explained, so it feels a bit weird.

For madrasa I'm not convinced about the ZIM name / filename. For instance you choose madrasa_astronomy_ar_all while I consider it should be madrasa_ar_astronomy (project is madrasa, selection is astronomy, just like we have wikipedia_en_football, ...)

And for madrasa is there any reason to keep the two disabled recipes? Especially madrasa_ar_playlists which still uses the playlist mode?

@RavanJAltaie
Copy link

RavanJAltaie commented Mar 5, 2024

@benoit74

I had a quick look and it seems that in many cases, you simply removed the playlist mode and created one big ZIM instead of many small ones, is this correct? The only exception is madrasa?

Yes that's correct, this is the decision made by @Popolechien & me after discussing the #878 issue.

I'm not convinced by this strategy, usually there was only 5/6 playlists and it did not looked like the number of playlists was frequently updated. Small ZIMs are usually more practical for our users. For https://farm.openzim.org/recipes/voa_learning_en_all for instance, we moved from ZIM ranging from 59.48 MB to 12.72 GB to one enormous (from my perspective at least) 24.93G ZIM. But maybe users are always downloading all ZIMs, so the extra work to create individual ZIMs is not worth it. It is just that this decision is very opaque and has not been explained, so it feels a bit weird.

That's the strategy followed in creating madrasa playlists, but for the few corrected playlists, we've decided to keep them in one file, but I can re-discuss this with @Popolechien today and change it if agreed upon. Personally I don't think it worths splitting the playlists.

For madrasa I'm not convinced about the ZIM name / filename. For instance you choose madrasa_astronomy_ar_all while I consider it should be madrasa_ar_astronomy (project is madrasa, selection is astronomy, just like we have wikipedia_en_football, ...)

I agree with you, I'll change the naming for all the files and apply this on new creations as well.

And for madrasa is there any reason to keep the two disabled recipes? Especially madrasa_ar_playlists which still uses the playlist mode?

No, no reason, I'll open an issue to delete them.

@rgaudin
Copy link
Member

rgaudin commented Mar 5, 2024

Also, as the convention clearly expresses, Project name instead of domain name should be exceptional. I have the feeling this rule frequently abused. @Popolechien @RavanJAltaie please clarify this

@RavanJAltaie
Copy link

Also, as the convention clearly expresses, Project name instead of domain name should be exceptional. I have the feeling this rule frequently abused. @Popolechien @RavanJAltaie please clarify this

in this case the naming for madrasa should be: Youtube_ar_madrasa_astronomy?

@rgaudin
Copy link
Member

rgaudin commented Mar 5, 2024

madrasa.org_ar_astronomy for this one but for all the youtube-only recipes, there a convention needs to be decided, document and followed.

@benoit74
Copy link
Collaborator

benoit74 commented Mar 5, 2024

in this case the naming for madrasa should be: Youtube_ar_madrasa_astronomy?

We must reserve _ as separator in ZIM name and ZIM filename, i.e. project and selection must use only alphanums + . + -, I will update the convention to make it clearer (speak up if I forgot a needed character).

That's the strategy followed in creating madrasa playlists, but for the few corrected playlists, we've decided to keep them in one file, but I can re-discuss this with @Popolechien today and change it if agreed upon. Personally I don't think it worths splitting the playlists.

No need to discuss it again if it has been agreed upon, just it would have been better to put these conclusions here before so that everyone involved would be aware of this and we keep a track record, I'm pretty sure we will have a question about it in few months.

What about older ZIMs (per playlist), do we keep them in the library? For madrasa, since you are changing the name, you will probably also have to delete older ZIMs.

@benoit74
Copy link
Collaborator

benoit74 commented May 22, 2024

Let's close finish this issue regarding zimfarm ability to create multiple ZIM per recipe for youtube scraper.

Everything else can be discussed separately if needed (and is at least partially already an ongoing effort)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants