-
Notifications
You must be signed in to change notification settings - Fork 74
Add YouTube caching util #278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add YouTube caching util #278
Conversation
ricecooker/utils/youtube_cache.py
Outdated
| Get YouTube playlist info by either requesting URL or extracting local cache | ||
| :param use_cache: Define if allowed to get playlist info from local JSON cache, default to True | ||
| :param youtube_ignore_error: Do not stop on download errors. | ||
| Please enable this option when videos of playlist is private or deleted thus extraction won't be blocked on those videos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting youtube_ignore_error=True is known to cause problems with playlists when using the web proxy. We need to raise errors in cases where a network error occurs 429 (cheffing server IP blocked by youtube), so that we change to a different proxy server, but if we give ignoreerrors=True to youtube_dl these errors get absorbed.
On the other hand setting ignoreerrors=False breaks the json post-processing logic in YouTubeResource, since it will assume all results are valid so we'll either need to add special handling there (or better create a special class for handling playlist like you have started).
I have another idea for workaround to use extract_flat option to get just the URLs of playlists resource, then manually download each video (skipping the ones with missing permission and broken ones). This will allow us to reuse all the caching logic for individual videos instead of having a giant json blob for the whole playlist. I will provide more details on this issue https://github.com/learningequality/pressurecooker/issues/32 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ivan, thanks for your comments. That's a very interesting point on option ignore_error that I'm not aware of. The reason I enable this option is that during extraction YouTube info, some of the deleted, private and even normal videos from the playlist will block progress of the whole playlist. So I choose to use this option and meanwhile design a helper function to extract YouTube info for single video and insert that to playlist cache file(for somehow, single YT video extraction is much stable than a YT playlist). Seems more researches and trails are necessary for those options and I will update this code part.
|
Hi Wengyu, I'm very excited by this PR as it is a much needed tool for simplifying YouTube downloads (so we don't have to replicate the same code in each chef). I haven't tried the code yet, but looking forward to testing out next week. Could you provide me a link to some sample code how to use these classes? Heads up I'll likely have "change requests" about this PR: /1/ Add the functionality to pressurecooker instead of ricecooker. The idea is that tools in /2/ Rename the classes to /3/ Perhaps add a new resource /4/ Another thing that I have been looking into related to caching are these two options: which would allow us to cached the FULL info json results as returned by the youtube API, which will be easier for chef classes that use the YoutubeDL data (rather than the simplified json that contains only the ricecooker fields). The less "custom stuff" we do, the better, because we can't know in advance what parts of info each chef will need, it's best to use the native functionality for saving the json and loading the json as a cache. It will require a little more disk storage, but better. /6/ Last but not least, it would be nice to add another cache for which-language-subtitles-are-available which is a very common use case, see https://github.com/learningequality/sushi-chef-khan-academy/blob/master/network.py#L89-L117 |
|
Hi @ivanistheone , actually I have a |
ricecooker/utils/youtube_cache.py
Outdated
| else: | ||
| self.cache_dir = cache_dir | ||
| if not os.path.isdir(self.cache_dir): | ||
| os.mkdir(self.cache_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend using os.makedirs(self.cache_dir, exist_ok=True) instead. This way, you don't have to do the isdir check, and it will also create any parent directories of self.cache_dir that are missing as well.
ricecooker/utils/youtube_cache.py
Outdated
|
|
||
| class YouTubeVideoCache(object): | ||
|
|
||
| def __init__(self, video_id, alias='', cache_dir=''): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the use-case for specifying an alias for the cache name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be some potential usage that users might wanna use self-defined cache filename instead of the YouTube ID by default which actually is not straightforward enough. For example, in the Refugee Response project, I use the language code as the cache filename.
|
Thanks @WenyuZhang1992! @ivanistheone and I discussed, and it would be okay to keep this code in Overall I think the code looks good! It appears the two classes share a majority of code, with the main difference being the playlist's handling of children. Could we consolidate the two classes into one class with |
kollivier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
No description provided.