Buildout YouTubeResource functionality #333

ivanistheone · 2019-01-09T15:12:40Z

@kollivier Please check this extra functionality added by Alejandro around the YouTubeResource class and see if there aren't any parts we might want to incorporate into pressurecooker:
https://github.com/elaeon/sushi-chef-science-ahmed-al-hoot-ar/blob/master/sushichef.py#L162-L279

^ all seems very useful and reusable

The text was updated successfully, but these errors were encountered:

ivanistheone · 2020-07-13T00:16:36Z

Various youtube automation efforts:

Old attempt at caching logic https://github.com/fle-internal/youtube_utils (outdated)
Updated caching logic https://github.com/benjaoming/sushi-chef-ubongokids/blob/2020-update/youtube.py#L56-L123
extract_flat trick https://github.com/benjaoming/sushi-chef-ubongokids/blob/2020-update/sushichef.py#L75 (get the urls of the videos in the playlist without actually downloading)
scrape all playlists for a user https://github.com/learningequality/sushi-chef-khan-academy/blob/master/sushichef.py#L401-L441
Brand new cache of info json PR https://github.com/learningequality/ricecooker/pull/278/files

ivanistheone · 2020-07-13T03:24:06Z

The YouTubeResource class is currently limited in it's ability to process playlists and channels/usernames. However the functionality for videos has been proven to be very useful (robust to all kinds of errors and with support for proxy servers). Recently PR#278 was opened which provides additional caching functionality.

It is time to revisit the functionality in pressurecooker.youtube to implement some general purpose scraper that all chef code can use.

Requirements

robust to all errors and exceptions
maintain proxy functionality (rotate to a new proxy server when networks errors like 429)
maintain backward compatibility of YouTubeResource for existing chefs (only used for videos)
add support for caching (using json files saved to filesystem)
add support for playlist > videos and channel > playlist > videos

Design

Continue to do proxy selection and automatic proxy use base don ENV variables
Maintain data as close to "native" info dict format used by YoutubeDL (use for caching and allow
Allow users access to the raw info json
Add to_node functions to return data formatted for use in ricecooker
- For videos to_node returns metadata suitable for VideoNode + YouTubeVideoFile (and optionally subtitles)
- For playlists to_node returns metadata suitable for TopicNode containing VideoNode children
- For channel/playlists to_node returns a two-levels of topic hierarchy and VideoNode leaf nodes

Classes

YouTubeResource

maintain current interface for backward compatibility
implementation is just calls the new YouTubeVideo
raise error if used with playlist or channel URL
get_resource_info returns json formatted for ricecooker = YouTubeVideo.to_node

YouTubeBase

provide interface similar to underlying YoutubDL class
handles auto proxy selection and rotation on network errors
provides robust error handling for all errors and exceptions
get_info method that returns same data as ydl.extract_info(url, download=False, process=True)
does not do any "packaging" for ricecooker (see subclasses)

YouTubeVideo(YouTubeBase)

method to_ricecooker_node returns metadata suitable for VideoNode + YouTubeVideoFile
add get_subtitle_languages method see here including caching
implements download method (in case chef needs direct access to video files)
(future) to_studio_node return metadata required to create Studio ContentNode

Example usage to download video_url and all available subs:

yt_vid = YouTubeVideo(url=video_url)
vid_metadata = yt_vid.get_ricecooker_node()
vid_node = VideoNode(**vid_metadata)
vid_node.add_file(YouTubeVideoFile(url=vid_metadata['id'], lang=?))
lang_codes = yt_vid.get_subtitle_languages()
for lang_code in lang_codes:
    vid_node.add_file(YouTubeSubtitleFile(youtube_id=vid_metadata['id'], lang=lang_code))

YouTubePlaylist(YouTubeBase)

We don't want to call YouTubeBase directly on youtube url because this results in O(n) API calls to the YouTube API and leads to blocked
Instead use "lightweight" playlist downloader based on extract_flat and calls to YouTubeVideo
to_ricecooker_node method returns metadata suitable for TopicNode containing VideoNode children
(future) to_studio_node return metadata required to create Studio ContentNode
(optional) download method that downloads all videos in playlist to a folder

Example usage to download playlist_url:

yt_pl = YouTubePlaylist(url=playlist_url)
pl_metadata = yt_pl.get_ricecooker_node(options={"extract_flat":True})
video_urls = pl_metadata.pop('children')
topic_node = TopicNode(**pl_metadata)
for video_url in video_urls:
    yt_vid = YouTubeVideo(url=video_url)
    vid_metadata = yt_vid.get_ricecooker_node()
    vid_node = VideoNode(**vid_metadata)
    vid_node.add_file(YouTubeVideoFile(url=vid_metadata['id'], lang=?))
    topic_node.add_child(vid_node)

YouTubeChannel(YouTubeBase)

to to_ricecooker_node returns info required to create the TopicNode for that channel
get_playlists_flat = list of URLs of all playlists in for that youtube channel (or username)
see example code, but note this code is problematic since it results in thousands of youtube calls using the same proxy server --- this is why we need to replace it with calls that use the "extract_flat":True option and separate calls to YouTubeBase/YoutubeDL so that each request gets assigned a new proxy server.

Example usage, to download the videos from all the playlists of the youtube user KhanAcademyKiswahili, run:

channel_node = Channel(name="KA Swahili", source_id, ...)
yt_ch = YouTubeChannel(id="KhanAcademyKiswahili")
playlist_urls = yt_ch.get_playlists_flat() # == get_info(options={"extract_flat":True})['entries']
for playlist_url in playlist_urls:
    yt_pl = YouTubePlaylist(url=playlist_url)
    pl_metadata = yt_pl.get_ricecooker_node(options={"extract_flat":True})
    video_urls = pl_metadata.pop('children')
    topic_node = TopicNode(**pl_metadata)
    for video_url in video_urls:
        yt_vid = YouTubeVideo(url=video_url)
        vid_metadata = yt_vid.get_ricecooker_node()
        vid_node = VideoNode(**vid_metadata)
        vid_node.add_file(YouTubeVideoFile(url=vid_metadata['id'], lang=?))
        topic_node.add_child(vid_node)
    channel_node.add_child(topic_node)

@WenyuZhang1992 ^ note the usage examples above refer to code that doesn't exist yet — this is just my proposal for classes and methods that would handle all the user cases and would be easy to use in chef code. This is a kind of readme-driven-programming ;)

rtibbles transferred this issue from learningequality/pressurecooker May 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buildout YouTubeResource functionality #333

Buildout YouTubeResource functionality #333

ivanistheone commented Jan 9, 2019

ivanistheone commented Jul 13, 2020

ivanistheone commented Jul 13, 2020 •

edited

Buildout YouTubeResource functionality #333

Buildout YouTubeResource functionality #333

Comments

ivanistheone commented Jan 9, 2019

ivanistheone commented Jul 13, 2020

ivanistheone commented Jul 13, 2020 • edited

Requirements

Design

Classes

YouTubeResource

YouTubeBase

YouTubeVideo(YouTubeBase)

YouTubePlaylist(YouTubeBase)

YouTubeChannel(YouTubeBase)

ivanistheone commented Jul 13, 2020 •

edited