Skip to content

Conversation

@WenyuZhang1992
Copy link
Contributor

No description provided.

Get YouTube playlist info by either requesting URL or extracting local cache
:param use_cache: Define if allowed to get playlist info from local JSON cache, default to True
:param youtube_ignore_error: Do not stop on download errors.
Please enable this option when videos of playlist is private or deleted thus extraction won't be blocked on those videos
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting youtube_ignore_error=True is known to cause problems with playlists when using the web proxy. We need to raise errors in cases where a network error occurs 429 (cheffing server IP blocked by youtube), so that we change to a different proxy server, but if we give ignoreerrors=True to youtube_dl these errors get absorbed.

On the other hand setting ignoreerrors=False breaks the json post-processing logic in YouTubeResource, since it will assume all results are valid so we'll either need to add special handling there (or better create a special class for handling playlist like you have started).

I have another idea for workaround to use extract_flat option to get just the URLs of playlists resource, then manually download each video (skipping the ones with missing permission and broken ones). This will allow us to reuse all the caching logic for individual videos instead of having a giant json blob for the whole playlist. I will provide more details on this issue https://github.com/learningequality/pressurecooker/issues/32 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ivan, thanks for your comments. That's a very interesting point on option ignore_error that I'm not aware of. The reason I enable this option is that during extraction YouTube info, some of the deleted, private and even normal videos from the playlist will block progress of the whole playlist. So I choose to use this option and meanwhile design a helper function to extract YouTube info for single video and insert that to playlist cache file(for somehow, single YT video extraction is much stable than a YT playlist). Seems more researches and trails are necessary for those options and I will update this code part.

@ivanistheone
Copy link
Contributor

Hi Wengyu, I'm very excited by this PR as it is a much needed tool for simplifying YouTube downloads (so we don't have to replicate the same code in each chef). I haven't tried the code yet, but looking forward to testing out next week. Could you provide me a link to some sample code how to use these classes?


Heads up I'll likely have "change requests" about this PR:

/1/ Add the functionality to pressurecooker instead of ricecooker. The idea is that tools in pressurecooker can later also be reused in Kolibri Studio to allow import-from-youtube, so this is why we have been putting all the youtube utils in pressurecooker (code base shared by both ricecooker and studio).

/2/ Rename the classes to YouTubeVideoResource and YouTubePlaylistResource (this would be an opportuinty to revisit the functionality in YouTubeResource and give a simpler interface based on the past experience form various chefs that used).

/3/ Perhaps add a new resource YouTubeVideoChannel (a container of playlists) see example. The YouTubeVideoChannel could in turn call YouTubePlaylistResource to reuse all the playlist caching logic.

/4/ Another thing that I have been looking into related to caching are these two options:

    --write-info-json                Write video metadata to a .info.json file
    --load-info-json FILE            JSON file containing the video information (created with the "--write-info-json" option)

which would allow us to cached the FULL info json results as returned by the youtube API, which will be easier for chef classes that use the YoutubeDL data (rather than the simplified json that contains only the ricecooker fields).

The less "custom stuff" we do, the better, because we can't know in advance what parts of info each chef will need, it's best to use the native functionality for saving the json and loading the json as a cache. It will require a little more disk storage, but better.

/6/ Last but not least, it would be nice to add another cache for which-language-subtitles-are-available which is a very common use case, see https://github.com/learningequality/sushi-chef-khan-academy/blob/master/network.py#L89-L117
We can add that as helper method on YouTubeVideoResource -- the ability to get and cache the subtitle info independently of downloading the video download.

@WenyuZhang1992
Copy link
Contributor Author

Hi @ivanistheone , actually I have a sushichef using this functionality here: https://github.com/learningequality/sushi-chef-refugee-response-crisis-advice/blob/master/utils.py#L141-L306 and this PR pretty much organizes and moves that code snippet to ricecooker. Your comment offers great suggestions and lots of insights which I will start to redesign this functionality from.

else:
self.cache_dir = cache_dir
if not os.path.isdir(self.cache_dir):
os.mkdir(self.cache_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend using os.makedirs(self.cache_dir, exist_ok=True) instead. This way, you don't have to do the isdir check, and it will also create any parent directories of self.cache_dir that are missing as well.


class YouTubeVideoCache(object):

def __init__(self, video_id, alias='', cache_dir=''):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use-case for specifying an alias for the cache name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be some potential usage that users might wanna use self-defined cache filename instead of the YouTube ID by default which actually is not straightforward enough. For example, in the Refugee Response project, I use the language code as the cache filename.

@kollivier
Copy link
Contributor

Thanks @WenyuZhang1992! @ivanistheone and I discussed, and it would be okay to keep this code in ricecooker, since the caching functionality is specific to ricecooker.

Overall I think the code looks good! It appears the two classes share a majority of code, with the main difference being the playlist's handling of children. Could we consolidate the two classes into one class with get_video_info and get_playlist_info methods? Maybe call it YouTubeUtils? (and also rename the file to utils/youtube.py to match) This way caching is just a feature of this class, and we can continue to add more ricecooker-specific helper methods to it later on. (e.g. maybe a get_video_node or even get_video_topic helper method)

@kollivier kollivier self-requested a review August 4, 2020 22:01
Copy link
Contributor

@kollivier kollivier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kollivier kollivier merged commit 961b532 into learningequality:master Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants