Skip to content

Latest commit

 

History

History
240 lines (226 loc) · 8.1 KB

yt_metadata.md

File metadata and controls

240 lines (226 loc) · 8.1 KB

Download YouTube metadata & subtitles:

Usage

In order to extract metadata from youtube videos you must configure your config such that the the yt_metadata_args entry is present in the yt_args key of the reading specifications as such:

yt_args:
    download_size: 360
    download_audio_rate: 44100
    yt_metadata_args:
        writesubtitles: 'all'
        subtitleslangs: ['en']
        writeautomaticsub: True
        get_info: True

Additionally if you specify captions_are_subtitles to be true in the storage parameters then each video and audio sample will be clipped according to the subtitles and divided into many unique samples.

Output

For every sample the metadata will be present in the json file as such:

{
    "description": "For the past five years, King Fish has been creating a media channel for IBM to generate leads of senior IT decision makers and retain current customers.  We produce dozens of webcasts every year for numerous divisions within IBM. King Fish provides managed services, original content and audience development. \n\nKFM worked with IBM to develop video content on how SPSS Statistics can help their clients meet business goals with advanced data insight methods. The result? Much more effective than an info-graphic.",
    "videoID": "QW3-5OuWn4M",
    "start": 56.1025,
    "end": 66.10249999999999,
    "caption": "IBM SPSS",
    "url": "http://youtube.com/watch?v=QW3-5OuWn4M",
    "key": "000000_00001",
    "status": "success",
    "error_message": null,
    "yt_meta_dict": {
        "info": {
            "id": "QW3-5OuWn4M",
            "title": "IBM SPSS",
            "thumbnail": "https://i.ytimg.com/vi/QW3-5OuWn4M/maxresdefault.jpg",
            "description": "For the past five years, King Fish has been creating a media channel for IBM to generate leads of senior IT decision makers and retain current customers.  We produce dozens of webcasts every year for numerous divisions within IBM. King Fish provides managed services, original content and audience development. \n\nKFM worked with IBM to develop video content on how SPSS Statistics can help their clients meet business goals with advanced data insight methods. The result? Much more effective than an info-graphic.",
            "uploader": "King Fish Media",
            "uploader_id": "KingFishMediaBoston",
            "uploader_url": "http://www.youtube.com/user/KingFishMediaBoston",
            "channel_id": "UCDy7Xb5vYxbmSosQmztCCcQ",
            "channel_url": "https://www.youtube.com/channel/UCDy7Xb5vYxbmSosQmztCCcQ",
            "duration": 122,
            "view_count": 116,
            "average_rating": null,
            "age_limit": 0,
            "webpage_url": "https://www.youtube.com/watch?v=QW3-5OuWn4M",
            "categories": [
                "Science & Technology"
            ],
            "tags": [
                "IBM",
                "technology",
                "statistics",
                "data",
                "analysis",
                "computers",
                "content marketing",
                "Software"
            ],
            "playable_in_embed": true,
            "live_status": "not_live",
            "release_timestamp": null,
            "comment_count": null,
            "chapters": null,
            "like_count": 1,
            "channel": "King Fish Media",
            "channel_follower_count": 10,
            "upload_date": "20131107",
            "availability": "public",
            "original_url": "http://youtube.com/watch?v=QW3-5OuWn4M",
            "webpage_url_basename": "watch",
            "webpage_url_domain": "youtube.com",
            "extractor": "youtube",
            "extractor_key": "Youtube",
            "playlist": null,
            "playlist_index": null,
            "display_id": "QW3-5OuWn4M",
            "fulltitle": "IBM SPSS",
            "duration_string": "2:02",
            "is_live": false,
            "was_live": false,
            "requested_subtitles": {
                "en": {
                    "ext": "vtt",
                    "url": "https://www.youtube.com/api/timedtext?v=QW3-5OuWn4M&caps=asr&xoaf=5&hl=en&ip=0.0.0.0&ipbits=0&expire=1676200746&sparams=ip%2Cipbits%2Cexpire%2Cv%2Ccaps%2Cxoaf&signature=A43F4C223A9DBC7E3BFBC61027FC5AF70D709AB5.B386EB52DD412DEFC3E8DBBCF7F30C442473CDA4&key=yt8&kind=asr&lang=en&fmt=vtt",
                    "name": "English"
                }
            },
            "_has_drm": null,
            "format": "137 - 1920x1080 (1080p)+251 - audio only (medium)",
            "format_id": "137+251",
            "ext": "mkv",
            "protocol": "https+https",
            "language": null,
            "format_note": "1080p+medium",
            "filesize_approx": 12831366,
            "tbr": 841.009,
            "width": 1920,
            "height": 1080,
            "resolution": "1920x1080",
            "fps": 30,
            "dynamic_range": "SDR",
            "vcodec": "avc1.640028",
            "vbr": 691.069,
            "stretched_ratio": null,
            "acodec": "opus",
            "abr": 149.94,
            "asr": 48000,
            "audio_channels": 2
        }
    },
    "clips": [
        [
            "00:00:07.749",
            "00:00:07.759"
        ]
    ] 
}

And since we specified that captions_are_subtitles the txt file will have the subtitle for that given clip inside of it. For this particular example it would be: "analytics to assess performance based on"

Multilingual Subtitles

To control the language/s of the subtitles from your videos, you can prvoide either 'first' or 'all' for writesubtitles (any value that evalutes to True will work also work as 'all').

first: This will extract subtitles for the first language that is in subtitleslangs for which there exists subtitles.
all: Attempt to extract subtitles for every language in subtitleslangs.

Below are some example outputs with subtitleslangs: ['en', 'es', 'fr'].

Using writesubtitles: 'first':

{
    "url": "https://www.youtube.com/watch?v=CvHAfXKIvgw",
    ...
    "yt_meta_dict": {
        ...
        "subtitles": {
            "en": [
                {
                    "start": "00:00:02.100",
                    "end": "00:00:03.360",
                    "lines": [
                        "Good morning Lisa"
                    ]
                },
                ...
            ]
        }
    },
    "clips": [
        [
            2.1,
            3.36
        ]
    ],
    "clip_subtitles": [
        {
            "start": "00:00:02.100",
            "end": "00:00:03.360",
            "lines": {
                "en": [
                    "Good morning Lisa"
                ]
            }
        }
    ]
}

Using writesubtitles: 'all':

{
    "url": "https://www.youtube.com/watch?v=CvHAfXKIvgw",
    ...
    "yt_meta_dict": {
        ...
        "subtitles": {
            "en": [
                {
                    "start": "00:00:02.100",
                    "end": "00:00:03.360",
                    "lines": [
                        "Good morning Lisa"
                    ]
                },
                ...
            ],
            "es": [
                {
                    "start": "00:00:02.100",
                    "end": "00:00:03.360",
                    "lines": [
                        "Buenos d\u00edas Lisa"
                    ]
                },
                ...
            ],
            "fr": [
                {
                    "start": "00:00:02.100",
                    "end": "00:00:03.360",
                    "lines": [
                        "Bonjour Lisa"
                    ]
                },
                ...
            ]
        }
    },
    "clips": [
        [
            2.1,
            3.36
        ]
    ],
    "clip_subtitles": [
        {
            "start": "00:00:02.100",
            "end": "00:00:03.360",
            "lines": {
                "en": [
                    "Good morning Lisa"
                ],
                "es": [
                    "Buenos d\u00edas Lisa"
                ],
                "fr": [
                    "Bonjour Lisa"
                ]
            }
        }
    ]
}