Add YouTube caching util #278

WenyuZhang1992 · 2020-07-12T17:23:37Z

No description provided.

ivanistheone · 2020-07-13T01:05:36Z

ricecooker/utils/youtube_cache.py

+        Get YouTube playlist info by either requesting URL or extracting local cache
+        :param use_cache: Define if allowed to get playlist info from local JSON cache, default to True
+        :param youtube_ignore_error: Do not stop on download errors.
+                                     Please enable this option when videos of playlist is private or deleted thus extraction won't be blocked on those videos


Setting youtube_ignore_error=True is known to cause problems with playlists when using the web proxy. We need to raise errors in cases where a network error occurs 429 (cheffing server IP blocked by youtube), so that we change to a different proxy server, but if we give ignoreerrors=True to youtube_dl these errors get absorbed.

On the other hand setting ignoreerrors=False breaks the json post-processing logic in YouTubeResource, since it will assume all results are valid so we'll either need to add special handling there (or better create a special class for handling playlist like you have started).

I have another idea for workaround to use extract_flat option to get just the URLs of playlists resource, then manually download each video (skipping the ones with missing permission and broken ones). This will allow us to reuse all the caching logic for individual videos instead of having a giant json blob for the whole playlist. I will provide more details on this issue https://github.com/learningequality/pressurecooker/issues/32 .

Hi Ivan, thanks for your comments. That's a very interesting point on option ignore_error that I'm not aware of. The reason I enable this option is that during extraction YouTube info, some of the deleted, private and even normal videos from the playlist will block progress of the whole playlist. So I choose to use this option and meanwhile design a helper function to extract YouTube info for single video and insert that to playlist cache file(for somehow, single YT video extraction is much stable than a YT playlist). Seems more researches and trails are necessary for those options and I will update this code part.

ivanistheone · 2020-07-13T01:18:45Z

Hi Wengyu, I'm very excited by this PR as it is a much needed tool for simplifying YouTube downloads (so we don't have to replicate the same code in each chef). I haven't tried the code yet, but looking forward to testing out next week. Could you provide me a link to some sample code how to use these classes?

Heads up I'll likely have "change requests" about this PR:

/1/ Add the functionality to pressurecooker instead of ricecooker. The idea is that tools in pressurecooker can later also be reused in Kolibri Studio to allow import-from-youtube, so this is why we have been putting all the youtube utils in pressurecooker (code base shared by both ricecooker and studio).

/2/ Rename the classes to YouTubeVideoResource and YouTubePlaylistResource (this would be an opportuinty to revisit the functionality in YouTubeResource and give a simpler interface based on the past experience form various chefs that used).

/3/ Perhaps add a new resource YouTubeVideoChannel (a container of playlists) see example. The YouTubeVideoChannel could in turn call YouTubePlaylistResource to reuse all the playlist caching logic.

/4/ Another thing that I have been looking into related to caching are these two options:

    --write-info-json                Write video metadata to a .info.json file
    --load-info-json FILE            JSON file containing the video information (created with the "--write-info-json" option)

which would allow us to cached the FULL info json results as returned by the youtube API, which will be easier for chef classes that use the YoutubeDL data (rather than the simplified json that contains only the ricecooker fields).

The less "custom stuff" we do, the better, because we can't know in advance what parts of info each chef will need, it's best to use the native functionality for saving the json and loading the json as a cache. It will require a little more disk storage, but better.

/6/ Last but not least, it would be nice to add another cache for which-language-subtitles-are-available which is a very common use case, see https://github.com/learningequality/sushi-chef-khan-academy/blob/master/network.py#L89-L117
We can add that as helper method on YouTubeVideoResource -- the ability to get and cache the subtitle info independently of downloading the video download.

WenyuZhang1992 · 2020-07-13T03:11:24Z

Hi @ivanistheone , actually I have a sushichef using this functionality here: https://github.com/learningequality/sushi-chef-refugee-response-crisis-advice/blob/master/utils.py#L141-L306 and this PR pretty much organizes and moves that code snippet to ricecooker. Your comment offers great suggestions and lots of insights which I will start to redesign this functionality from.

kollivier · 2020-07-21T21:52:22Z

ricecooker/utils/youtube_cache.py

+        else:
+            self.cache_dir = cache_dir
+        if not os.path.isdir(self.cache_dir):
+            os.mkdir(self.cache_dir)


I would recommend using os.makedirs(self.cache_dir, exist_ok=True) instead. This way, you don't have to do the isdir check, and it will also create any parent directories of self.cache_dir that are missing as well.

kollivier · 2020-07-21T22:44:20Z

ricecooker/utils/youtube_cache.py

+
+class YouTubeVideoCache(object):
+
+    def __init__(self, video_id, alias='', cache_dir=''):


What is the use-case for specifying an alias for the cache name?

There might be some potential usage that users might wanna use self-defined cache filename instead of the YouTube ID by default which actually is not straightforward enough. For example, in the Refugee Response project, I use the language code as the cache filename.

kollivier · 2020-07-21T22:53:17Z

Thanks @WenyuZhang1992! @ivanistheone and I discussed, and it would be okay to keep this code in ricecooker, since the caching functionality is specific to ricecooker.

Overall I think the code looks good! It appears the two classes share a majority of code, with the main difference being the playlist's handling of children. Could we consolidate the two classes into one class with get_video_info and get_playlist_info methods? Maybe call it YouTubeUtils? (and also rename the file to utils/youtube.py to match) This way caching is just a feature of this class, and we can continue to add more ricecooker-specific helper methods to it later on. (e.g. maybe a get_video_node or even get_video_topic helper method)

kollivier

LGTM!

WenyuZhang1992 added 2 commits July 12, 2020 10:22

Add YouTube caching util

56f1955

Format source code

0b81562

ivanistheone reviewed Jul 13, 2020

View reviewed changes

WenyuZhang1992 added 5 commits July 18, 2020 16:19

Update YouTube cache features

95e156b

Add tests

6778e59

Fix logic errors

35d0925

Add test for YouTube playlist

b6663ab

Add subtitles support for YouTube cache

ea403bc

kollivier reviewed Jul 21, 2020

View reviewed changes

WenyuZhang1992 added 3 commits July 21, 2020 21:29

Concrete YouTube cache classes to YouTubeUtils

62bed7c

Restructure the code

c7aca7d

Add option use_proxy

3c3f1b5

kollivier self-requested a review August 4, 2020 22:01

kollivier approved these changes Aug 4, 2020

View reviewed changes

kollivier merged commit 961b532 into learningequality:master Aug 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add YouTube caching util #278

Add YouTube caching util #278

Uh oh!

WenyuZhang1992 commented Jul 12, 2020

Uh oh!

ivanistheone Jul 13, 2020

Uh oh!

WenyuZhang1992 Jul 13, 2020

Uh oh!

ivanistheone commented Jul 13, 2020

Uh oh!

WenyuZhang1992 commented Jul 13, 2020

Uh oh!

kollivier Jul 21, 2020

Uh oh!

kollivier Jul 21, 2020

Uh oh!

WenyuZhang1992 Jul 22, 2020

Uh oh!

kollivier commented Jul 21, 2020

Uh oh!

kollivier left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		class YouTubeVideoCache(object):

		def __init__(self, video_id, alias='', cache_dir=''):

Add YouTube caching util #278

Add YouTube caching util #278

Uh oh!

Conversation

WenyuZhang1992 commented Jul 12, 2020

Uh oh!

ivanistheone Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

WenyuZhang1992 Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

ivanistheone commented Jul 13, 2020

Uh oh!

WenyuZhang1992 commented Jul 13, 2020

Uh oh!

kollivier Jul 21, 2020

Choose a reason for hiding this comment

Uh oh!

kollivier Jul 21, 2020

Choose a reason for hiding this comment

Uh oh!

WenyuZhang1992 Jul 22, 2020

Choose a reason for hiding this comment

Uh oh!

kollivier commented Jul 21, 2020

Uh oh!

kollivier left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants